Randomized clinical trials. Design, practice and reporting [2 ed.] 9781119524656, 1119524652, 9781119524670, 1119524679


228 8 9MB

English Pages [535] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Randomized clinical trials. Design, practice and reporting [2 ed.]
 9781119524656, 1119524652, 9781119524670, 1119524679

  • Commentary
  • eBook
  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Randomised Clinical Trials

Randomised Clinical Trials Design, Practice and Reporting

Second Edition David Machin Leicester Cancer Research Centre University of Leicester, Leicester, UK Medical Statistics Group, School of Health and Related Research University of Sheffield, Sheffield, UK

Peter M. Fayers Institute of Applied Health University of Aberdeen, Scotland, UK

Bee Choo Tai Saw Swee Hock School of Public Health National University of Singapore and National University Health System Singapore Singapore, Singapore

This edition first published 2021, © 2021 John Wiley & Sons Ltd Edition History John Wiley & Sons, Ltd (1e, 2010) All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions. The right of David Machin, Peter M. Fayers, and Bee Choo Tai to be identified as the authors of this work has been asserted in accordance with law. Registered Office(s) John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK Editorial Office 9600 Garsington Road, Oxford, OX4 2DQ, UK For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com. Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats. Limit of Liability/Disclaimer of Warranty The contents of this work are intended to further general scientific research, understanding, and discussion only and are not intended and should not be relied upon as recommending or promoting scientific method, diagnosis, or treatment by physicians for any particular patient. In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of medicines, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each medicine, equipment, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Library of Congress Cataloging-in-Publication Data Names: Machin, David, 1939- author. | Fayers, Peter M., author. | Tai, Bee Choo, author. Title: Randomized clinical trials : design, practice and reporting / David Machin, Peter M. Fayers, Bee Choo Tai. Description: Second edition. | Hoboken, NJ : Wiley-Blackwell, 2021. | Includes bibliographical references and index. Identifiers: LCCN 2020044539 (print) | LCCN 2020044540 (ebook) | ISBN 9781119524649 (paperback) | ISBN 9781119524656 (adobe pdf) | ISBN 9781119524670 (epub) Subjects: MESH: Randomized Controlled Trials as Topic–methods | Biomedical Research–methods | Research Design | Data Interpretation, Statistical Classification: LCC R853.C55 (print) | LCC R853.C55 (ebook) | NLM W 20.55.C5 | DDC 610.72/4–dc23 LC record available at https://lccn.loc.gov/2020044539 LC ebook record available at https://lccn.loc.gov/2020044540 Cover Design: Wiley Cover Image: © Kenishirotie/Shutterstock Set in 10.5/12.5pt Minion by SPi Global, Pondicherry, India

10 9

8

7

6

5

4 3

2

1

To Lorna Christine Machin Tessa and Emma Fayers Isaac Xu-En and Kheng-Chuan Koh

Contents

Preface Part I

xiii Basic Considerations

1 Introduction 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11

Introduction Some completed trials Choice of design Practical constraints Influencing clinical practice History How do trials arise? Ethical considerations Regulatory requirements Focus Further reading

2 Design Features 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12

Introduction The research question Patient selection The consent process Choice of interventions Choice of design Assigning the interventions Making the assessments Analysis and reporting Technical details Guidelines Further reading

3 The Trial Protocol 3.1 3.2

Introduction Abstract

1 3 3 4 13 18 20 20 22 24 24 25 25

27 27 29 30 32 33 35 37 38 38 42 43 44

45 45 47

viii

CONTENTS 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18

Background Research objectives Design Intervention details Eligibility Randomisation Assessment and data collection Statistical considerations Ethical issues Organisational structure Publication policy Trial forms Appendices Regulatory requirements Guidelines Protocols

4 Measurement and Data Capture 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8

Introduction Types of measures Measures and endpoints Making the observations Baseline measures Data recording Technical notes Guidelines

5 Randomisation 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8

Introduction Rationale Mechanics Application Carrying out randomisation Documentation Unacceptable methods Guidelines

6 Trial Initiation 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8

Introduction Trial organisation Data collection and processing Internal data monitoring Ethical and regulatory requirements Launching the trial Trial registries Guidelines

49 49 52 53 56 58 61 63 66 69 69 70 71 72 74 74

77 77 78 80 91 92 93 101 101

103 103 104 104 113 115 119 120 120

121 121 122 130 132 133 134 134 135

CONTENTS

ix

7 Trial Conduct and Completion 7.1 7.2 7.3 7.4 7.5 7.6 7.7

Introduction Regular feedback Publicity Protocol modifications Preparing the publication(s) The next trial? Protocol

8 Basics for Analysis 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9

Introduction The standard Normal distribution Confidence intervals Statistical tests Examples of analysis Regression methods Other issues Practice Technical details

9 Trial Size 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8

137 137 137 141 142 142 145 146

147 147 148 149 150 152 169 179 182 183

185

Introduction Significance level and power The fundamental equation Specific situations Practical considerations Further topics Guideline Software

185 186 188 190 198 203 206 206

10 Data and Safety Monitoring

209

10.1 10.2 10.3 10.4 10.5

Introduction The DSMB Early reviews Interim reviews Protocols

11 Reporting 11.1 11.2 11.3 11.4 11.5 11.6

Introduction Publication Responsibilities Background Methods Findings

209 211 214 219 228

229 229 230 233 235 236 248

x

CONTENTS 11.7 When things go wrong 11.8 Conclusions 11.9 Guidelines

258 259 260

Part II Adaptions of the Basic Design

263

12 More Than Two Interventions

265

12.1 12.2 12.3 12.4 12.5 12.6

Introduction Unstructured comparisons Comparisons with placebo (or standard) Dose–response designs Factorial trials Complex structure comparisons

13 Paired and Matched Designs 13.1 13.2 13.3 13.4

Matched-pair trials Cross-over trials Split-mouth designs Guidelines

14 Repeated Measures Design 14.1 14.2 14.3 14.4 14.5 14.6 14.7 14.8 14.9 14.10

Introduction Simplified analysis Regression models Auto-correlation Accounting for auto-correlation The design effect (DE) Trial size Practicalities Reporting Matched organs receiving the same intervention

15 Non-Inferiority and Equivalence Trials 15.1 15.2 15.3 15.4 15.5 15.6 15.7 15.8

Introduction Non-inferiority Analysis Trial size Equivalence Reporting Practical Issues Guidelines

265 266 270 275 280 289

293 293 305 311 317

319 319 322 329 331 334 338 344 347 350 354

357 357 358 361 366 370 373 373 373

CONTENTS

xi

16 Cluster Designs 16.1 16.2 16.3 16.4 16.5 16.6 16.7 16.8 16.9

Design features Procedures Regression models Intra-class correlation Trial size Analysis Practicalities Reporting Further reading

17 Stepped Wedge Designs 17.1 17.2 17.3 17.4 17.5 17.6 17.7

Part III

Introduction Notation Basic structure Randomisation Cross-sectional design Closed cohort design Practicalities

Further Topics

18 Genomic Targets 18.1 18.2 18.3 18.4 18.5

Introduction Predictive markers Enrichment design Biomarker-Stratified Designs Adaptive threshold designs

19 Feasibility and Pilot Studies 19.1 19.2 19.3 19.4 19.5 19.6 19.7

Introduction Feasibility studies External-pilot studies Considerations across external-pilot and main trial Internal-pilot studies Other preliminary studies Reporting

20 Further Topics 20.1 Introduction 20.2 Adaptive approaches

375 375 376 379 380 381 386 388 388 389

391 391 392 396 398 398 408 413

415 417 417 418 420 422 431

435 435 436 437 444 445 447 449

451 451 452

xii

CONTENTS 20.3 20.4 20.5 20.6 20.7

Large simple trials Bayesian methods Interim analyses Zelen randomised consent designs Systematic overviews

461 463 467 472 476

Statistical Tables

483

Glossary

493

References

503

Index

523

Preface

It is now more than 10 years since the first edition of this book was published. In the intervening years, while many things have remained unchanged, there have also been many new developments over the period. This second edition refreshes the first, refining some of the sections to better explain their contents and, at the same time, replacing some examples with more current illustrations. To reflect the changes, we have created new chapters by splitting and then expanding old chapters. Thus, we now include full chapters on data and safety monitoring including interim analyses of accumulating data, cluster designs, repeated measures, and noninferiority designs as there has been a rapid increase in the use of such trial designs along with some methodological developments and improvements in statistical software available for analysis. In addition, we have included entirely new chapters on stepped wedge designs, genomic targets and, feasibility and pilot studies. The chapter on stepped wedge designs reflects the growing importance of such complex intervention designs while that on genomic targets highlights the research focus directed towards more individualised medicine. In contrast, the new chapter concerned with feasibility and pilot studies brings us back to the early planning stages of the clinical trial which is planned. The chapter is included as there is increasing recognition that perhaps a more structured approach is required at the planning stage of any proposed trial. The intention is to help avoid the conduct of clinical trials which fail due to the basic assumptions made at the planning stage being inappropriate. This edition is divided into three sections: I Basic Considerations, II Adaptions of the Basic Design and III Further Topics. As the title suggests, the first section is intended to cover topics that are relevant to all randomised trials of whatever design and complexity. Thus, it may be the key section for those who are new to clinical trials and an aidememoire for those more experienced in this area. For this purpose, it concentrates on the parallel two-group controlled trial with a single outcome measure where patients are randomised individually to one of the two interventions concerned. The second section expands on the individually randomised design in several ways by considering paired designs, repeated assessments of the (same) outcome measure over time, more than two interventions and non-inferiority trials. It also includes cluster trials, and stepped wedge designs in which groups rather than individuals are randomised to the interventions concerned. The final section deals with genomic targets, feasibility and pilot studies, and a final chapter on miscellaneous topics including adaptive

xiv

PREFACE

designs, large simple trials and very small trials with new additions describing alpha spending functions and the predictive probability test for use in interim analyses. We are grateful to many colleagues, collaborators and numerous investigators who have contributed directly or indirectly to this book over many years. We thank Isaac Koh for the cover design and Leo Liu for his professional advice on this.

David Machin Peter M. Fayers Bee Choo Tai Leicester and Sheffield, Aberdeen, and Singapore

PS As we read the proofs of this book, under lockdown conditions imposed by Covid-19, results of successful randomised trials with respect to treatments for those who have contracted the disease and protective vaccines against the pandemic have been published. These include the use of dexamethasone as described by The RECOVERY Collaborative Group (2020) and the Pfizer-BioNTech mega-sized vaccine trial against Covid-19 tested by Polack, Thomas, Kitchin, et al. (2020). To overcome the challenges in conducting clinical trials as a result of lockdown and the need to minimise face-to-face contact due to the infectious nature of the coronavirus, the use of e-consent is briefly discussed in Chapter 3.

PART I

Basic Considerations

CHAPTER 1

Introduction

A very large number of clinical trials with human subjects have been conducted in a wide variety of contexts. Many of these have been concerned, for example, with improving (in some way) the management of patients with disease and others the prevention of the disease or condition in the first place. The essence of a clinical trial is the comparison of a standard strategy with an alternative (perhaps novel) intervention. The aim of this chapter is to illustrate some of the wide variety of clinical trials that have been conducted and to highlight some key features of their design, conduct and analysis.

1.1

Introduction

The aim of this book is to introduce those who are to become involved with randomised clinical trials to the wide range of challenges that are faced by those who conduct such trials. Thus, our intended readership is expected to range from healthcare professionals of all disciplines who are concerned with patient care to those more involved with the non-clinical aspects such as the statistical design, data processing and subsequent analysis of the results. We assume no prior knowledge of clinical trial processes, and we have attempted to explain the more statistical sections in as non-technical a way as possible. In a first reading of this book, these sections could be omitted. Throughout the book, we stress the collaborative nature of clinical trials activity and would hope that readers would consult their more experienced colleagues on aspects of our coverage. The business of clinical trials is an ongoing process, and as we write, trials are currently being designed (particularly with respect to the coronavirus), opened, conducted, closed, analysed, reported, results filtered into current practice and the next planned. To describe the key features of this process, it is difficult to know where to start as each stage interacts with each of the others to some extent. For example, in designing a trial the investigators need to be mindful of the eventual analysis to be undertaken as this governs (but it is only one aspect of ) how large a trial should be launched. Some of the steps are intellectually challenging, for example, defining the key therapeutic question, whilst

Randomised Clinical Trials: Design, Practice and Reporting, Second Edition. David Machin, Peter M. Fayers, and Bee Choo Tai. © 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.

4

1 INTRODUCTION

others may perhaps appear more mundane, such as defining the data forms or the data entry procedures but all steps (whether large or small: major or minor) underpin the eventual successful outcome – the influence on clinical practice once the trial results are available. For many of these aspects of the process, whole books have been written. We can only provide an introduction to these. Numerous terms including ‘clinical trial’ itself need to be introduced. As a consequence, we have included a Glossary of Terms, which is mainly extracted from Day (2007) Dictionary of Clinical Trials. Thus, the Glossary defines: clinical trial: any systematic study of the effects of a treatment in human subjects. These definitions may not be exhaustive in the sense, that ‘treatment’ used here may be substituted by, for example, ‘intervention’ depending on the specific context of the clinical trial under consideration. Clinical trials require a multidisciplinary approach in which all partners play a key role at some stage of the trial process. Furthermore, ‘Evidence-Based Medicine’ (EBM) requires that it is important to consider critically all the available evidence about whether, for example, a treatment works, before recommending it for clinical practice. In this respect, it is therefore vital that one can clearly see that a proposed trial addresses a key question which will have a clinically meaningful outcome, is well designed, conducted and reported, and the results are persuasive enough to change clinical practice if appropriate. Despite perhaps not having a professional interest in the science of clinical trials, everyone has a vested interest in them as potential patients requiring care. How many of us have never been to see a doctor, had a hospital admission or taken medication? All of us may be, have been, or certainly will be, recipients of clinical trial results whether during prebirth, at birth or in childhood for vaccination and minor illness, as an adult for fertility, sports injuries, minor and major non-life-threatening or life-threatening illnesses, and in old age for care related to our mental or physical needs.

1.2 Some completed trials As we have indicated, there are countless ongoing trials and many have been successfully conducted and reported. To give some indication of the range and diversity of application, we describe a selection of clinical trials that have been conducted. Their designs include some features that we also draw upon as examples in later chapters.

Example 1.1

Small parallel two-group design – gastrointestinal function

Lobo, Bostock, Neal, et al. (2002) describe a randomised trial in which 20 patients with colonic cancer either received postoperative intravenous fluids in accordance with current hospital standard practice (S) or according to a restricted intake regimen (R). A primary endpoint measure in each patient was the solid-phase gastric

1.2

SOME COMPLETED TRIALS

Example 1.1 (Continued) emptying time on the fourth postoperative day. The observed difference between the median emptying times was shorter with R by 56 minutes with 95% confidence interval (CI) from 12 to 132 minutes. The trial also included preoperative and postoperative (days 0, 1, 3 and 5) measures of the concentrations of serum albumin, haemoglobin and blood urea in a repeated measures design. Key features include the following: • Design: Randomised comparison of a standard and test, single-centre participation, unblinded assessment, • Endpoint: Gastric emptying time, • Size: 21 patients following colonic resection, • Analysis: Mann–Whitney U-test1 for comparing two medians, • Conclusion: The restricted intake group had shorter delays in returning to gastrointestinal function. 1

This can also be referred to as the Wilcoxon rank-sum test.

Example 1.2

Parallel two-group design – hepatitis B

Levie, Gjorup, Skinhøj and Stoffel (2002) compared a 2-dose regimen of recombinant hepatitis B vaccine including the immune stimulant AS04 with the standard 3-dose regimen of HbsAg in healthy adults. The rationale behind testing a 2-dose regimen was that fewer injections would improve compliance. Key features include the following: • Design: Two centres, open-label randomised two-group comparison, • Endpoint: Seroprotection rate, • Size: 340 healthy adults aged between 15 and 40 years, • Analysis: Fisher’s exact test, • Conclusion: The 2-dose regimen compared favourably with the standard.

5

6

1 INTRODUCTION

Example 1.3

Unstructured three-group design – newly diagnosed type 2 diabetes

The randomised trial of Weng, Li, Xu, et al. (2008) compared, in newly diagnosed patients with type 2 diabetes, three treatments: multiple daily insulin injections (MDI), continuous subcutaneous insulin infusion (CSII) and oral hypoglycaemic agent (OHA). Key features include the following: • Design: Nine centres, randomised three-group comparison, • Endpoint: Time of glycaemic remission, • Size: 410 newly diagnosed patients with type 2 diabetes, • Analysis: Cox proportional-hazards regression model, • Conclusion: Early intensive therapy has favourable outcomes on recovery and maintenance of β-cell function and protracted glycaemic remission compared to OHA.

Example 1.4

Small dose–response design – pain prevention following hand surgery

Stevinson, Devaraj, Fountain-Barber, et al. (2003) conducted a randomised double-blind, placebo-controlled trial to compare placebo with homoeopathic arnica 6C and arnica 30C to determine the degree of pain prevention in patients with carpel tunnel syndrome undergoing elective surgery for their condition. Pain was assessed postoperatively with the short-form McGill Pain Questionnaire (SF-MPQ) at four days. A total of 64 patients were randomised to the three groups resulting in median scores of 16.0 (range 0–69), 10.5 (0–76) and 15.0 (0–82) for the respective groups. From these results, the authors suggest that homoeopathic arnica has no advantage over placebo in reducing levels of postoperative pain. Key features include the following: • Design: Single-centre, randomised double-blind, placebo-controlled, three-group dose response, • Endpoint: Pain using the SF-MPQ, • Size: 64 patients undergoing hand surgery for carpal tunnel syndrome, • Analysis: Kruskal–Wallis test, • Conclusion: Irrespective of dose homoeopathic arnica has no advantage over placebo.

1.2

SOME COMPLETED TRIALS

Example 1.5

Large dose–response design – HER2-positive breast cancer

Smith, Procter, Gelber, et al. (2007) showed that 1 year of treatment with Trastuzumab (T) after adjuvant therapy in HER2-positive patients with breast cancer was superior to Observation (O) alone. They reported a hazard ratio, HR = 0.66 (95% CI 0.47 to 0.91, p-value = 0.0115) for overall survival in favour of adjuvant treatment. This comparison was from two arms of a three-arm large multicentre international randomised trial comprising 1698 patients randomised to O, 1703 to T for 1 year (T1) and 1701 to T for 2 years (T2): a total of 5102 patients. Key features include the following: • Design: Randomised, multicentre, observation versus active treatment, • Size: Part of a large trial of 5102 women with HER2-positive breast cancer, • Endpoint: Overall survival, • Analysis: Comparison in 3404 women from the O and T1 groups using survival curves, • Conclusion: Treatment with T1 after adjuvant chemotherapy has a significant overall survival benefit.

Example 1.6 Non-inferiority trial – uncomplicated falciparum malaria Zongo, Dorsey, Rouamba, et al. (2007) conducted a randomised non-inferiority trial to test the hypothesis that the risk of recurrent parasitaemia was not significantly worse with artemether–lumefantrine (AL) than with amodiaquine plus sulfadoxine–pyrimethamine (AQ + SP). A total of 826 patients were screened of which 548 were found to have uncomplicated malaria and were randomised (273 to AQ + SP and 275 to AL). A primary endpoint was the risk of treatment failure within 28 days of randomisation. The authors concluded that AQ + SP, with a recurrent malaria rate of 1.7% (4/233), was more effective than AL, with a rate of 10.2% (25/245) and representing a difference of 8.5% (95% CI 4.3–12.6%). These results suggest that the hypothesis of ‘non-inferiority’ should not be accepted as the CI included the non-inferiority limit of 3% set by the investigators. Key features include the following: • Design: Multicentre, two-group, non-inferiority trial, • Endpoint: Time to recurrent malaria, • Size: Large – 548 patients with uncomplicated falciparum malaria, • Analysis: Comparison of Kaplan–Meier survival curves, • Conclusion: AL was less effective than (inferior to) AQ + SP.

7

8

1 INTRODUCTION

Example 1.7

Repeated measures – atopic eczema

Meggitt, Gray and Reynolds (2006) randomised 63 patients with moderate-tosevere atopic eczema to receive either Azathioprine or Placebo in a double-blind formulation to ascertain the relative reduction in disease activity determined by the six-area six-sign atopic dermatitis (SASSAD) score between the groups. One patient in each group subsequently withdrew from the trial before treatment was initiated. The investigators reported a 5.4 unit advantage with Azathioprine. In this trial, patients were randomised using a minimisation procedure, in the ratio of 2 to 1 in favour of Azathioprine in order to. … encourage recruitment, to reduce the numbers receiving pharmacologically inactive systemic treatment, and to increase the likelihood of identifying infrequent adverse events. Key features include the following: • Design: Single-centre, randomised double-blind, placebo-controlled, randomised 2 : 1 allocation ratio using minimisation, • Endpoint: SASSAD, • Size: 63 patients with moderate-to-severe atopic eczema, • Analysis: Comparison of mean group regression slopes over a 12-week period, • Conclusion: Azathioprine produces a clinically relevant improvement.

Example 1.8

Cross-over trial – known or suspected hypertension

Kerley, Dolan, James and Cormican (2018) describe a randomised placebo (P) controlled, two-period cross-over trial of dietary nitrate (N) in 20 patients with known or suspected hypertension. The P and N interventions were delivered in beetroot juice in a nitrogen-depleted or nitrogen-enriched form, respectively. Thirteen of the individuals were randomised to receive the sequence NP, that is N in Period I of the trial followed by P in Period II, and the other seven were allocated the sequence PN. Amongst the many endpoints, plasma nitrate and ambulatory blood pressure were recorded prior to randomisation, then 7 days later following the start of treatment in Period I and a further 7 days following Period II. The authors concluded: Our results support … an anti-hypertensive effect of dietary nitrate … . Key features include the following: • Design: Single-centre, randomised placebo-controlled, two-period cross-over trial,

1.2

SOME COMPLETED TRIALS

Example 1.8 (Continued) • Size: 20 patients with known or suspected hypertension, unequal numbers assigned to the sequences, • Washout: None included, • Endpoints: Plasma nitrate and ambulatory blood pressure, • Analysis: Complex methodology described. All statistical tests were conducted at the twosided 0.05 significance level, • Conclusion: Nitrogen-enriched has an anti-hypertensive effect.

Example 1.9 Paired design – glaucoma Glaucoma Laser Trial Research Group (1995) recruited 271 subjects with newly diagnosed primary open-angle glaucoma, and one eye of each patient was randomly assigned as initial treatment by argon laser (L) trabeculoplasty followed by Stepped (S) medication (LS). The other eye then received the treatments in reverse order, SL. They reported on the 261 eyes and found that measures of visual field status for eyes treated by the sequence LS were slightly better than those treated by SL. The authors’ state: Statistical significance was attained for only some of the differences, and the clinical implications of such small differences are not known. Key features include the following: • Design: Multicentre, paired design, compares alternative schedules for administering two procedures – the schedule was randomised to one eye with the other eye receiving the alternative, • Endpoint: Visual field status, • Size: 271 patients with primary open-angle glaucoma, • Analysis: Comparison of means at particular time points following initiation of treatment using the paired t-test, • Conclusion: Eyes treated with laser trabeculoplasty first were slightly better than those eyes treated with topical medication first.

9

10

1 INTRODUCTION

Example 1.10

Split-mouth design – implants for edentulous sites

Pozzi, Agliardi, Tallarico and Barlattani (2012) conducted a trial in 34 partially edentate patients who required at least two single implant-supported crowns. A split-mouth design was used in which one of two different prosthetic interfaces and configurations: internal conical connection with back-tapered collar and platform shifting (CC) or external-hexagon implants with flat-to-flat implantabutment interface (EH), were randomly allocated at each edentulous site. From a total of 88 implants included in the trial, the authors concluded that both implants performed similarly in terms of failure rates. Key features include the following: • Design: Single-centre, split-mouth, random allocation, • Endpoint: Failure rates and marginal bone loss, • Size: 34 patients with 88 edentulous sites, • Analysis: Comparing implants using paired t-tests at several intervals postrandomisation, • Conclusion: Lower marginal bone loss with CC when compared to EH.

Example 1.11

Cluster trial – hip protectors for the elderly

Meyer, Warnke, Bender and Mülhauser (2003) conducted a trial involving 942 residents from 49 nursing homes. In this cluster trial design, the nursing homes contain ‘clusters’ of residents and the homes (not the individual residents) were randomised, with 25 homes, comprising a total of 459 residents, assigned to the intervention group and 24, with 483 residents, to the usual care (control) group. The intervention comprised a single education session for nursing staff, who then educated residents, and the provision of three hip protectors per resident. The control clusters administered usual care optimised by brief information to nursing staff about hip protectors and the provision of two hip protectors per cluster for demonstration purposes. The main outcome measure was the incidence of hip fractures. There were 21 hip fractures in 21 (4.6%) residents in the intervention group and 42 in 39 (8.1%) residents in the control group – a difference of 3.5% (95% CI 0.3–7.3%, p-value = 0.072). The authors concluded: The introduction of a structured education programme and the provision of free hip protectors in nursing homes may reduce the number of hip fractures.

1.2

SOME COMPLETED TRIALS

Example 1.11

(Continued)

Key features include the following: • Design: Multicluster, randomised, • Size: 49 nursing homes comprising 942 residents with high risk of falling, • Endpoint: Hip fractures, • Analysis: Chi-squared test adjusted for cluster randomisation, • Conclusion: A structured education programme and the provision of hip protectors may reduce the number of hip fractures.

Example 1.12

Single arm – discrete de novo lesions in a coronary artery

Erbel, Di Mario, Bartunek, et al. (2007) describe a non-randomised multicentre trial involving 8 centres in which 63 patients were enrolled with single de novo lesions in a native coronary artery. In these patients, a total of 71 biodegradable magnesium stents were successfully implanted. The (composite) primary endpoint was the rate of major adverse cardiac events (MACE) defined as any one of: cardiac death, Q-wave myocardial infarction or target lesion revascularisation at 4 months poststent implant. This was to be compared with an anticipated rate of 30%. They reported a rate of MACE at 4 months of 15/63 (23.8%); all of which were attributed to target lesion revascularisation (there were no deaths or Q-wave myocardial infarctions) and concluded: … biodegradable magnesium stents can achieve an immediate angiographic result similar to … other metal stents … .

Nevertheless, the authors also commented in their Discussion: The absence of randomisation precludes direct comparison with other techniques of percutaneous revascularisation. Key features include the following: • Design: No comparison group hence non-randomised, multicentre, • Size: 71 stents in 63 patients, • Endpoint: Composite endpoint – MACE, • Analysis: Proportion experiencing MACE with 95% confidence interval, • Conclusion: Bioabsorbable stents can achieve an immediate angiographic result similar to other metal stents and can be safely degraded.

11

12

1 INTRODUCTION

The above examples of successfully completed clinical trials illustrate a wide range of topics investigated. These include patients with disease (breast cancer, colon cancer, eczema, glaucoma, malaria and diabetes mellitus), those requiring coronary artery stents or hand surgery, elderly residents of nursing homes, patients aged 25 years or more requiring at least two implant-supported crowns for dental caries, healthy individuals and those requiring vaccinations. Although not included here, trials are conducted, for example, to evaluate different diagnostic procedures, different bed mattresses to reduce the incidence of bed sores, different dressings for wounds of all types and fertility regulation options for male and females of reproductive potential. These trials are often termed Phase III trials in contrast with Phase I and Phase II trials which are concerned with early stages of the (often pharmaceutical) development process. Although the trials differ in aspects of their design, the majority have the general structure of a two (or more) group parallel design in which eligible patients are assigned to receive the alternative options (often treatments but more generally termed interventions) and then at some later time assessed in a way which will be indicative of (successful) outcome. The outcomes measured in these trials include the following: survival time, gastric emptying time, reduction in disease activity, visual field status, recurrent parasitaemia, major adverse cardiac events, pain, the number of hip fractures, systolic blood pressure and standard criteria used to assess dental restorations. In the trial of homoeopathic arnica for pain relief following hand surgery, assessment was made in a double-blind or double-masked manner in which neither the patient nor the assessor was aware of the specific treatment option actually received. The methods used for the allocation to the options included simple randomisation of equal numbers per group, a 2 to 1 allocation; a minimisation procedure taking into account patient characteristics, randomisation to nursing homes (clusters) rather than to individual residents. For the split-mouth design used for the comparing dental implants the authors’ state: For randomization of the implant type, a pregenerated random sequence was created … . Opaque envelopes were sealed according to pregenerated list. An independent judge prepared all envelopes. … an assistant indicated which implant had to be placed first following the indications contained in the sequentially number envelope.

The non-random allocation to a single-arm study using a new bioabsorbable stent for coronary scaffolding might now be regarded as a feasibility study although the trial results were compared to that from historical data. The trials ranged in size from 20 patients with colonic cancer to 5102 women with HER2-positive breast cancer. One trial involved 522 eyes from 271 subjects another 88 single implant-supported crowns teeth in 34 partially edentate patients. Although not fully detailed in the above summaries, methods of statistical analysis ranged from a simple comparison of two proportions to relatively complex methods using techniques for survival time outcomes. In general, trials are designed to establish a difference between the (therapeutic) options under test and were one to exist. Consequently, they are sometimes termed superiority trials. However, in certain circumstances, as in the trial for the treatment of

1.3

CHOICE OF DESIGN

13

uncomplicated falciparum malaria, the research team were looking for non-inferiority implying that the two treatment strategies of AQ + SP and AL would give very similar risks of failure. In the event, the trial suggested that AL was (unacceptably) less effective than AQ + SP implying that non-inferiority was not established. Such designs usually imply that a satisfactory outcome is that the test treatment does not perform worse than the standard to an extent predefined by the investigating team. Thus, use of a non-inferiority design often implies that, although some therapeutic loss may be conceded on the main outcome variable, other factors favouring the new therapy will have some features (gain) to offset this. For example, if the new compound was a little less effective (not equal to) but had a better toxicity profile, then this might be sufficient to prefer it for clinical practice.

1.3

Choice of design

1.3.1 Biological variability Measurements made on human subjects rarely give exactly the same results from one occasion to the next. Even in adults, our height varies a little during the course of the day. If one measures the blood sugar levels of an individual on one particular day and then again the following day, under exactly the same conditions, greater variation in this than that of height would be expected. Hence were such an individual to be assessed and then receive an intervention (perhaps to lower blood sugar levels) any lowering recorded at the next assessment cannot necessarily be ascribed to the intervention itself. The levels of inherent variability may be very high so that, perhaps in the circumstances where a subject has an illness, the oscillations in these may disguise, at least in the early stages of treatment, the beneficial effect of the treatment given to improve the condition.

Example 1.13

Patient-to-patient variability – atopic eczema

The considerable between patient variability in the trial of Example 1.7 is illustrated in Figure 1.1. In the 41 patients receiving Azathioprine, the reduction in disease activity (SASSAD) ranged from −10 to 32. There is considerable overlap of these values with those from the 20 patients receiving Placebo whose values range from −12 to 20. This figure clearly illustrates that, although there is considerable variation, the majority of patients in both groups improve. Further, the corresponding reduction in percentage body area affected with Azathioprine was reported to range from approximately −15 to 85% and for placebo approximately −20 to 45%. Nevertheless, even with the majority of patients improving in both groups, the trial of Meggitt, Gray and Reynolds (2006) indicated a better outcome, on average, for those receiving Azathioprine.

14

1 INTRODUCTION

Reduction in SASSAD score from baseline

Example 1.13

(Continued)

30

20

10

0

–10 Placebo

Azathioprine

Figure 1.1 Individual patient reductions in disease activity (SASSAD) for the Azathioprine and Placebo treatment groups with the corresponding means indicated. Source: Data from Meggitt, Gray and Reynolds (2006).

With such variability, it follows that, in any comparison made in a biomedical context, differences between subjects or groups of subjects frequently occur. These differences may be due to real effects, random variation or both. It is the job of the experimenter to decide how this variation should be taken note of in the design of the ensuing trial. The purpose being that, once at the analysis stage, the variation can be partitioned suitably into that due to any real effect of the interventions on the difference between groups and that from the random or chance component. 1.3.2 Randomisation Ronald A Fisher (1890–1962) in laying the foundations of good experimental design, although in an agricultural and biological context, advocated the use of randomisation in allocating experimental treatments. Thus, for example, in agricultural trials various plots in a field are randomly assigned to the different experimental interventions. The argument for randomisation is that it will prevent systematic differences between the allocated plots receiving the different interventions, whether or not these can be identified by the investigator concerned, before the experimental treatment is applied.

1.3

CHOICE OF DESIGN

15

Then, once the experimental treatments are applied and the outcome observed, the randomisation enables any differences between treatments to be estimated objectively and without bias. In these and many other contexts, randomisation has long been a keystone to good experimental design. The need for random allocation extends to all experimental situations including those concerned with patients as opposed to agricultural plots of land. The difficulty arises because clinical trials (more emotive than experiments) do indeed concern human beings who cannot be regarded as experimental units and so should not be allocated the interventions without their consent. The consent process clearly complicates the allocation process and, at least in the past, has been used as a reason to resist the idea of randomisation of patients to treatment. Unfortunately, the other options, perhaps a comparison of patients receiving a ‘new’ treatment with those from the past receiving the ‘old’, are flawed in the sense that any observed differences (or lack thereof ) may not reflect the true situation. Thus, in the context of controlled clinical trials, Pocock (1983) concluded, many years ago and some 30 years after the first randomised trials were conducted, that: The proper use of randomization guarantees that there is no bias in the selection of patients for the different treatments and so helps considerably to reduce the risk of differences in experimental environment. Randomized allocation is not difficult to implement and enables trial conclusions to be more believable than other forms of treatment allocation.

As a consequence, we are focussing on randomised controlled trials and not giving much attention to less scientifically rigorous options. 1.3.3 Design hierarchy The final choice of design for a clinical trial will depend on many factors, key amongst these are clearly the specific research question posed, the practicality of recruiting patients to such a design and the resources necessary to support the trial conduct. We shall discuss these and other issues pertinent to the design choice in later chapters. Nevertheless, we can catalogue the main types of design options available and these are listed in Figure 1.2. This gives a relative weight to the evidence obtained from these different types of clinical trial. All other things being equal, the design that maximises the weight of the resulting evidence should be chosen. For expository purposes, we assume that a comparison of a new test treatment with the current standard for the specific condition in question is being made. 1.3.3.1 Randomisation The design that provides the strongest type of evidence is the double-blind (or doublemasked) randomised controlled trial (RCT). In this, the patients are allocated to treatment at random and this ensures that in the long run patients, before treatment commences, will be comparable in the test and standard groups. Clearly, if the important

16

1 INTRODUCTION Evidence level Strongest

Type of trial Double-blind randomised controlled trial (RCT) Single-blind RCT Non-blinded (open) RCT Non-randomised prospective trial Non-randomised retrospective trial Before-and-after design (historical control)

Weakest

Case-series

Figure 1.2 The relative strength of evidence obtained from alternative designs for comparative clinical trials

prognostic factors that influence outcome were known, one could match the patients in the standard and test groups in some way. However, the advantage of randomisation is that it balances for unknown and the known prognostic factors and this could not be achieved by matching. Thus, the reason for the attraction of the randomised trial is that it is the only design that can give an absolute certainty that there is no bias in favour of one group compared to another at the start of the trial. Indeed, in Example 1.12, Erbel, Di Mario, Bartunek, et al. (2007), who essentially conducted a singlearm prospective case study, admitted that failure to conduct a randomised comparison compromised their ability to draw definitive conclusions concerning the stent on test. 1.3.3.2 Blinding or masking For the simple situation in which the attending clinician is also the assessor of the outcome, the trial should ideally be double-blind (or double-masked). This means that neither the patient nor the attending clinician will know the actual treatment allocated. Having no knowledge of which treatment has been taken, neither the patient nor the clinician can be influenced at the assessment stage by such knowledge. In this way, an unprejudiced evaluation of the patient response is obtained. Thus Meggitt, Gray and Reynolds (2006) used double-blind formulations of Azathioprine or Placebo so that neither the patients with moderate-to-severe eczema, nor their attending clinical team, were aware of who received which treatment. Although they did not give details, the blinding is best broken only at the analysis stage once all the data had been collated. Despite the inherent advantage of this double-blind design, most clinical trials cannot be conducted in this way as, for example, a means has to be found for delivering the treatment options in an identical way. This may be a possibility if the standard and test are available in tablet form of identical colour, shape, texture, smell and taste. If such

1.3

CHOICE OF DESIGN

17

‘identity’ cannot be achieved, then a single-blind design may ensue. In such a design, one of the patient or the clinical assessor has knowledge of the treatment being given but the other does not. In trials with patient survival time as the endpoint, double-blind usually means that both the patient and the treating physician and other staff are blinded. However, assessment is objective (death) and the blinding irrelevant by this stage. Finally, and this is possibly the majority situation, there will circumstances in which neither the patient nor the assessor can be blind to the treatments actually received. Such designs are referred to as ‘open’ or ‘open-label’ trials. 1.3.3.3 Non-randomised designs In certain circumstances, when a new treatment has been proposed for evaluation, all patients are recruited prospectively but allocation to treatment is not made at random. In such cases, the comparisons may well be biased and hence are unreliable. The bias arises because the clinical team choose which patients receive which intervention and in so doing may favour (even subconsciously) giving one treatment to certain patient types and not to others. In addition, the requirement that all patients should be suitable for all options may not be fulfilled – in that if it is known that a certain option is to be given to a particular subject then one may not so rigorously check if the other options are equally appropriate. Similar problems arise if investigators have recruited patients into a singlearm study, and the results from these patients are then compared with information on similar patients having (usually in the past) received a relevant standard therapy for the condition in question. However, such historical comparisons are likely to be biased also and to an unknown extent so again it will not be reasonable to ascribe the difference (if any) observed entirely to the treatments themselves. Of course, in either case, there will be situations when one of these designs is the only option available. In such cases, a detailed justification for not using the ‘gold standard’ of the randomised controlled trial is required. Understandably, in this era of EBM, information from non-randomised comparative studies is categorised as providing weaker evidence than that from randomised trials. The before-and-after design is one in which, for example, patients are treated with the Standard option for a specified period and then, at some fixed point in time, subsequent patients receive the Test treatment. This is the type of design used by Erbel, Di Mario, Bartunek, et al. (2007) to evaluate a bioabsorbable stent for coronary scaffolding. In such examples, the information for the Standard is retrospective in nature and is often obtained from clinical records only and so was not initially collected for trial purposes. If this is the case, the before-and-after design is likely to be further compromised as, for example, in the ‘before’ period, the patient selection criterion, clinical assessments and data recorded may not meet the standards required of the ‘after’ component. Such differences are likely to influence the before-and-after comparison in unforeseen and unknown ways.

18

1 INTRODUCTION

Example 1.14

Glioblastoma in the elderly – non-randomised design

Brandes, Vastola, Basso, et al. (2003) describe a study comparing radiotherapy alone (Group A), radiotherapy and the combination of procarbazine, lomustine and vincristine (Group B) and radiotherapy with temozolomide (Group C) in 79 elderly patients with glioblastoma. The authors’ state: The first group (Group A) was enrolled in the period from March 1993 to August 1995 … . The second group (Group B) was enrolled from September 1995 to September 1997 … . The third group (Group C) was enrolled from September 1997 to August 2000 and … .

The authors conclude: Overall survival was better in Group C compared with Group A (14.9 months v 11.2 months, P = 0.002), but there was no statistical differences found between Groups A and B or between Groups B and C.

However, since patients have not been randomised to groups, one cannot be sure that the differences (and lack of differences) truly reflect the relative efficacy of the three treatments concerned. This type of design should be avoided if at all possible.

1.3.3.4 Case series A case series consists of a study in which the experience of an investigator treating a series of patients with a particular approach reports on their outcome. This may be the only ‘design’ option available in rare or unusual circumstances but is unlikely to provide clear evidence of efficacy. There are many criticisms of this design. Generally one may not know how the patients have been selected; the clinical team may have an eye for selecting those patients to be given the treatment who are likely to recover in any event: without further evidence of the natural history of the disease, we do not know whether the patients may have recovered spontaneously, without intervention: we do not know whether their approach to treatment is better than any alternatives.

1.4 Practical constraints Control of the ‘experiment’ is clearly a desirable feature – perhaps easy to attain in the physics laboratory where experimental conditions are tightly controlled but not so easy with living material particularly if they are human. A good trial should answer the questions posed as efficiently as possible. In broad terms, this implies recruiting as few subjects as is reasonably possible for a reliable answer to be obtained.

1.4

PRACTICAL CONSTRAINTS

19

Although good science may lead to an optimal choice of design, the exigencies of real life may cause these ideals to be modified. All the same, one can still keep in mind the hierarchy in the choice of designs of Figure 1.2, but where to enter this hierarchy will depend on circumstance. Thus, the investigators do not aim for the best design, but only the best realisable design in their context. Technical (statistical) aspects of experimental design can be used in a whole variety of settings. Nevertheless, there are specific problems associated with implementing these designs in practice in the field of clinical trials. It is clear that trials cannot be conducted without human subjects (often patients); nevertheless, the constraints this imposes are not inconsiderable. Figure 1.3 illustrates some aspects that need to be considered when conducting such trials. As we have indicated, the requirements for human studies are usually more stringent than in other research areas. For example, safety, in terms of the welfare of the experimental units involved, is of overriding concern in clinical trials, possibly of less relevance in animal studies and of no relevance to laboratory studies. In some sense, the laboratory provides, at least in theory, the greatest rigour in terms of the experimental design, and studies in human subjects should be designed (whenever possible) to be as close to these standards as possible. However, no consent procedures from the experimental units nor from animals, if they are involved, are required, whereas this is a very important consideration in all human experimentation even in a clinical trial with therapeutic intent. Constraints may also apply to the choice of interventions to compare. For example, in certain therapeutic trials there may be little chance that a placebo option will bring any benefit (although this is certainly not the case in all circumstances) so comparisons

Design feature Method of assessments Treatment or Intervention

If invasive – may not be acceptable. Implicit that treatment should do some good – thus an innocuous or placebo treatment may not be acceptable.

Subject safety issues Protocol Review Consent Recruitment

Overriding principle is the safety of the subjects Scientific and ethical Fully informed consent mandatory Usually, subjects recruited one-by-one over calendar time

Time scale

May be relatively long – rarely weeks, seldom months, quite often years

Trial size Patient losses Observations Design changes Data protection

Not too large or too small Subjects may refuse to continue in the trial at any stage Usually, subjects assessed one-by-one over calendar time Almost certainly require new ethical approval Confidentiality and often National Guidelines for storage and transfer.

Reporting

CONSORT for Phase III trials (Moher, Hopewell, Schultz, et al, 2010)

Figure 1.3 Special considerations for clinical trials in human subjects

20

1 INTRODUCTION

may have to be made between two allegedly ‘active’ approaches despite little direct evidence that either of them will bring benefit. However, if, at the end of such a trial, a difference between treatments is demonstrated then activity for the better of the two is established so in one sense comparison with a placebo was not necessary. In contrast should the two treatments appear not to differ in their effectiveness then no conclusions can be drawn since one does not know whether both are equally beneficial or whether both are equally ineffective as compared to Placebo. Thus, an investigating team conducting this type of trial needs to be fully aware of the potential difficulties. Ethical considerations, as judged perhaps by a local, national or international committee, may also prevent the ‘optimal’ design being implemented. There are also issues related to patient data confidentiality which may, in the circumstances of a multicentre trial, make synthesis of all the trial data problematical. We address other components of Figure 1.3 in later sections of the book.

1.5 Influencing clinical practice As we have indicated, an important consideration at the design stage of a trial is to consider whether, if the new treatment proves effective, the trial will be reliable enough in itself to convince clinical teams not associated with the trial of the findings. Importantly, if a benefit is established will this be quickly adopted into national clinical practice? Experience has suggested that all too frequently trials have less impact than they deserve although it is recognised that results that are adopted in practice are likely to be from trials of an appropriate size, conducted by a respected group and have multicentre involvement. Thus, there are considerations, in some sense outside the strict confines of the design, which investigators should heed if their findings are to have the desired impact. Some basic or administrative things can help reassure the eventual readers of the reliability of the trial results. These include, although some of these may be mandatory, registering the trial itself, involving and informing other clinical colleagues outside the trial team of progress, careful documentation of any serious adverse events, ensuring the trial documentation is complete, establishing procedures for responding to external queries, clarity of the final reporting document in the research literature and seeking avenues for wider dissemination of the trial results.

1.6 History Probably the single most important contribution to the science of comparative clinical trials was the recognition by Austin Bradford Hill (1897–1991) in the 1940s that patients should be allocated the options under consideration at random so that comparisons should be free from bias. Consequently, the first randomised trial was planned to test the value of a pertussis vaccine to prevent whooping-cough and the results of which were subsequently published by the Medical Research Council

1.6

HISTORY

21

Whooping-Cough Immunization Committee (1951). He later stated: ‘The aim of the controlled clinical trial is very simple: it is to ensure that the comparisons we make are as precise, as informative, and as convincing as possible’. This development by itself may not have led to more theoretically based statistical innovation directly, but was the foundation for the science of clinical trials. Nevertheless, the history of clinical trials research precedes this important development by many years. Thus, clinical trials were mentioned by Avicenna (980–1037) in his The Canon of Medicine (1025) in which he laid down rules for the experimental use and testing of drugs and wrote a precise guide for practical experimentation in the process of discovering and proving the effectiveness of medical drugs and substances. His rules and principles for testing the effectiveness of new drugs and medications are summarised in Figure 1.4, and these still form the basis of modern clinical trials. One of the most famous clinical trials was that conducted by James Lind (1716–1794) in 1747. He compared the effects of various different acidic substances, ranging from vinegar to cider, on groups of sailors afflicted by scurvy and found that the group who were given oranges and lemons had largely recovered from their scurvy after 6 days. Somewhat later, Frederick Akbar Mahomed (1849–1884) founded the Collective Investigation Record for the British Medical Association. This organisation collated data from physicians practising outside the hospital setting and was an important precursor of modern collaborative clinical trials. The conduct of clinical trials research is multidisciplinary in nature so that a team effort is always needed from the concept stage, through design, conduct, monitoring and reporting. This collaborative effort has led not only to medical developments in many areas but to developments of a more statistical nature. Thus, for those working in cancer and for whom survival was a key endpoint in the clinical trials the two seminal papers published by Peto, Pike, Armitage, et al. (1976, 1977) in the British Journal of Cancer marked a new era. These papers provided the template for key items essential to the design, conduct, analysis and reporting of randomised trials with emphasis on

(1) The drug must be free from any extraneous accidental quality. (2) It must be used on a simple, not a composite, disease. (3) The drug must be tested with two contrary types of diseases, because sometimes a drug cures one disease by its essential qualities and another by its accidental ones. (4) The quality of the drug must correspond to the strength of the disease. For example, there are some drugs whose heat is less than the coldness of certain diseases, so that they would have no effect on them. (5) The time of action must be observed, so that essence and accident are not confused. (6) The effect of the drug must be seen to occur constantly or in many cases, for if this did not happen, it was an accidental effect. (7) The experimentation must be done with the human body, for testing a drug on a lion or a horse might not prove anything about its effect on man.

Figure 1.4 Avicenna’s rules for the experimental use and testing of drugs

22

1 INTRODUCTION

those requiring prolonged observation of each patient. In particular, these papers described the Kaplan and Meier (1958) estimate of the survival curve, logrank test and the stratified logrank test in such detail that any careful investigator could follow the necessary steps. A computer program (termed the Oxford program) had also been distributed (some time before the date of the publications themselves), and this allowed the methods suggested by the papers to be implemented. Certainly, for those working in data centres with responsibility for many (often reasonably large) trials, this program facilitated the analysis and helped to ensure that the ideas expressed in these articles were widely disseminated. These papers formed the basic text for those involved in clinical trials for many years and (besides making the ideas accessible to medical statisticians) their role in easing the acceptance of statistical ideas into the clinical community cannot be underestimated. It should not go unnoticed that David Roxbee Cox was one of the authors of the seminal papers referred to above although his paper describing the proportionalhazards regression model appeared some 4 years earlier (Cox, 1972). His paper was presented at a discussion meeting of the UK Royal Statistical Society and subsequently published in Series B of the Society’s journals. This journal deals with the more theoretical aspects of statistical research and does not make easy reading for many statisticians and would not be one to which clinical teams might readily refer. Despite this, this particular paper is probably one of the most cited papers in the medical literature. In brief, the methodology leads to easier analysis of trials with survival time endpoints that include stratification in their design and/or baseline patient characteristics at the time of randomisation which may affect prognosis. As we have indicated, EBM requires that it is important to critically assess all the available evidence about whether an intervention works. Thus, systematic overviews have become a vital component of clinical trial research and are routinely applied before launching new trials as a means of confirming the need to carry out a clinical trial or after completing trials as a means of synthesising and summarising the current knowledge on the topic of interest. These reviews are the focal interest of the Cochrane Collaboration, and the associated handbook by Higgins, Thomas, Chandler, et al. (2019) provides the key to their implementation. Some developments have not depended on technical advancement (although there are always some) such as the now standard practice of reporting confidence intervals rather than relying solely on p-values at the interpretation stage. Of major importance over this same time period has been the expansion in data processing capabilities and the range of analytical possibilities only made feasible by the amazing development in computer power. Despite many advances, the majority of randomised controlled trials remain simple in design – most often a two-group comparison.

1.7 How do trials arise? Although the focus of this book is on comparative, or Phase III, trials to establish the relative efficacy of the interventions under test, it should be recognised that these may be preceded by an often extensive research programme starting with the laboratory bench,

1.7

HOW DO TRIALS ARISE?

23

ed domis ive ran Definit trolled trial con l ry tria

rato

Explo

lling

Mode ry

Explore relevant theory to ensure best choice of intervention and Hypothesis and to predict major confounders and strategic design issues

Identify the components of the intervention and the underlaying mechanisms by which they will influence outcomes to provide evidence that you can predict how they relate to and interact with each other

Preclinical

Phase I

Theo

Descibe the constant and variable components of a replicable intervention and a feasible protocol for comparing the intervention with an appropriate alternative

Compare a fully defined intervention with an appropriate alternative using a protocol that is theoretically defensible, reproducible, and adequately controlled in a study with appropriate statistical power

Phase II

Phase III

erm Long-t tation men le p im Determine whether others can reliably replicate your intervention and results in uncontrolled settings over the long-term

Phase IV

Continuum of increasing evidence

Figure 1.5 Sequential phases of developing randomised controlled trials of complex interventions. Source: Campbell, Fitzpatrick, Haines, et al., (2000).

moving to animal studies and then to early and later stage studies in man. Also once the Phase III stage itself is complete, there may be further studies initiated. Figure 1.5, taken from Campbell, Fitzpatrick, Haines, et al. (2000), succinctly summarises the pathway of the whole trial process. The steps range from studies to determine the pharmacokinetic profile of a drug in healthy volunteers (Preclinical) to establishing the appropriate dosage for use in man (Phase I), then the establishment of indications of activity (Phase II). However, some of these steps may be taken in parallel and even simultaneously in the same subjects. These early studies are not usually randomised. However, studies conducted by Krishna, Anderson, Bergman, et al. (2007) on the effect of the cholesteryl ester transfer protein inhibitor, anacetrapib, on lipoproteins in patients with dyslipidaemia are described by them as ‘randomized’ and ‘phase I’. Randomised they undoubtedly are but their use of the Phase I nomenclature does not have an exact parallel in Figure 1.4. This highlights a difficulty when attempting to categorise trials using such a simple system. One may imagine that there will be clear stages in the development of a bioabsorbable coronary stent. These too will not exactly parallel those of drug development although they may well involve laboratory and animal studies. The single-arm trial of Erbel, Di Mario, Bartunek, et al. (2007) may be considered as close to the Phase II type or a feasibility study of Chapter 19. There are also parallels (although modifications will be necessary) for new approaches to, for example, surgical, radiotherapy or physiotherapy techniques, and combinations of different procedures. They also extend beyond merely therapeutic

24

1 INTRODUCTION

trials to planning, for example, trials comparing alternative forms of contraception in women and those evaluating alternative health promotion interventions. However, in some instances, such as in trials comparing educational packages, they may start at the full Phase III stage without involving the earlier phases. Alternatively, comparative trials may evolve from questions arising in clinical practice and not from a specific development process. Thus, one may wish to compare different surgical timings, at 6 months or at 1 year of age, for reconstructive surgery in infants with cleft palate as is proposed in the trial conducted by Yeow, Young, Chen, et al. (2019). Whatever the pathway, the eventual randomised comparative trial to be conducted is clearly a major event as only when this has been conducted will there be reliable (although not necessarily convincing) evidence of the efficacy of the intervention concerned. In certain situations, often for regulatory purposes, a Phase III trial may be followed by a confirmatory trial asking essentially the same question. In addition, following regulatory approval of a product, so-called Phase IV or postmarketing trials may be initiated with the aim to gain broader experience with using the new product.

1.8 Ethical considerations For a trial to be ethical, at the time it is designed the ethical review committees will want to be convinced that there is collective uncertainty amongst clinicians as to which treatment is superior or more appropriate for the patients. They will also need to be persuaded that the sample size and other aspects of the study design are such that the trial is likely to provide information sufficient to reduce this uncertainty and thus influence subsequent medical practice if one treatment or the other appears superior. A clinical trial cannot go forward until the protocol has been through the appropriate ethical review processes, the exact nature of which varies from country to country. These should always include a very thorough review of the scientific aims as well as the more ‘subject’ oriented concerns to protect those who will be recruited to the trial. In brief terms, this implies that if a trial is not scientifically sound – then it should not be judged as ethically acceptable.

1.9 Regulatory requirements In addition to the more overtly scientific parts of the clinical trials process on which to focus, there are many regulatory requirements which national and international law obliges a trial team to adhere to. For example, the regulations insist that informed consent is obtained from patients entering trials and on the preservation of personal data confidentiality. These regulations are generally referred to as requirements for Good Clinical Practice (GCP) as is described in ICH E6 (R2) (2016). We will refer to specific aspects of GCP as they arise in the text, but readers are cautioned that the specifics are continually being updated. Principles to guide statisticians working on clinical trials have been laid down by ICH E9 (R1) (2018).

1.11 FURTHER READING

25

If the trial is seeking regulatory approval of (say) a new drug, then all the associated requirements for approval should be reviewed by the trial team before, during and after the trial protocol is being developed to ensure that all aspects are covered so as to avoid the rejection of the application on what might be a technical detail. For example, there may be a regulatory requirement for some additional animal studies to be conducted before approval can be granted. These requirements are summarised in documents such as those provided by the European Medicines Agency (EMEA) and the US Food and Drug Administration (FDA). In some circumstances, it is a requirement for regulatory approval that a confirmatory trial is conducted. Such a trial is essentially a repeat of an initial one, perhaps in a different or wider patient group or with a wider group of clinical teams involved, but it must follow the essential features of the predecessor design. Clearly, these details should be cross-checked with the relevant authorities before the protocol is finalised and patients are recruited.

1.10

Focus

As we have illustrated, the size of clinical trials can range from the relatively few to as many as several thousands of subjects being recruited. Consequently, and leaving specific details aside, these will require a range of resources from the relatively modest to the very considerable. It must be emphasised that the size of a clinical trial is determined by the question(s) that are posed, and the resources allocated should reflect the importance of that question. Clearly, a very experienced team is required to launch a large trial but even the design team of an ultimately small sized trial will need access to appropriate personnel including, at a minimum, those with clinical, statistical, data management and organisational skills and often other specialist skills from, for example, pharmacy, pathology and many other specialties. It is important that the design team do not underestimate the scale of the task. The focus of this book is on the design of (randomised) comparative (usually termed Phase III) trials which are likely to be of relatively modest size. We aim to give clear guidance as to how these may be designed, conducted, (to some extent) analysed and reported. However, it is also important that investigators contributing patients to clinical trials who are perhaps not part of the design team also understand the issues concerned as the very success of the trials depends crucially on their collaboration and understanding of the processes involved.

1.11

Further reading

Although Day (2007) provides a comprehensive list of books about clinical trials the following are particularly useful: Day S (2007). Dictionary for Clinical Trials. (2nd edn). Wiley, Chichester. Fitzpatrick S (2008a). Clinical Trial Design. ICR Publishing, Marlow.

26

1 INTRODUCTION

Fitzpatrick S (2008b). The Clinical Trial Protocol. ICR Publishing, Marlow. Girling DJ, Parmar MKB, Stenning SP, Stephens RJ and Stewart LA (2003). Clinical Trials in Cancer: Principles and Practice. Oxford University Press, Oxford. Machin D, Campbell MJ, Tan SB and Tan SH (2018). Sample Size Tables for Clinical, Laboratory and Epidemiology Studies. (4th edn). Wiley-Blackwell, Chichester. Redwood C and Colton T (eds) (2001). Biostatistics in Clinical Trials. Wiley, Chichester. Wang D and Bakhai A (eds) (2006). Clinical Trials: A Practical Guide to Design, Analysis and Reporting. Remedica, London

Hints on how to display medical data in tabular and graphical form are given by: Freeman JV, Walters SJ and Campbell MJ (2008). How to Display Data. BMJ Books, Oxford.

For those specifically interested in health-related quality of life issues: Fayers PM and Machin D (2016). Quality of Life: The Assessment, Analysis and Interpretation of Patient-reported Outcomes. (3rd edn). Wiley-Blackwell, Chichester.

For those specifically interested cluster randomised trials at a more technical level: Campbell MJ and Walters SJ (2014). How to Design, Analyse and Report Cluster Randomised Trials in Medicine and Health Related Research. Wiley, Chichester.

For those requiring a wide view of how randomised trials have impacted on clinical practice over a wide range of diseases and conditions: Machin D, Day S and Green S (eds) (2006). Textbook of Clinical Trials. (2nd edn). Wiley, Chichester.

CHAPTER 2

Design Features

This chapter gives an overview of the general structure of a randomised clinical trial. The key components are highlighted. These include the type of patients or subjects that are likely to be relevant to the objectives of the trial and then, within this group, those who are specifically eligible for the trial in mind, the research question(s) and the choice of design. We emphasise the requirement of fully informed consent before a patient is entered into a trial, the determination as to whether or not the interventions on offer are equally appropriate for the individual concerned, the method of allocation to the alternative interventions and the subsequent patient assessments required to determine the relevant trial endpoint(s). Also, aspects associated with analysis, reporting and interpretation of the results are included. Finally, we introduce the basic ideas of a statistical model upon which the ultimate analysis of the clinical trial is based.

2.1

Introduction

Although in this chapter we will focus on one particular design, there are many features of the clinical trials process which are common to the majority of situations. Thus, we use the example of a parallel two-group individually randomised trial to overview some pertinent issues, ranging from defining carefully the research question posed, and thereby the type of subjects to recruit, the interventions used, the allocation of the trial participants to these interventions, endpoint assessment, analysis and reporting. However, in later chapters we will expand on detail and also on other design options. This basic design will often compare a Test therapy or intervention with a Standard (or control) therapy. Frequently, the patients will be assigned at random to the options on a 1 : 1 basis. In reality, the actual choice of design will be a key issue at the planning stage and it should not be assumed that the common design used here for illustration best suites all purposes.

Randomised Clinical Trials: Design, Practice and Reporting, Second Edition. David Machin, Peter M. Fayers, and Bee Choo Tai. © 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.

28

2 DESIGN FEATURES

Example 2.1

Two-group parallel design – symptomatic oral lichen planus

Poon, Tin, Kim, et al. (2006) describe a randomised trial of the comparative effect of the then current Standard, topical steroid (S), and the Test, topical cyclosporine (C), in patients with systematic oral lichen planus. The basic structure of their trial is given in Figure 2.1.

This trial typifies the design and structure of the randomised parallel-group comparative trial which is used extensively. The key features include the following: defining the general types of subjects to be studied, identifying the particular subjects eligible for the trial and obtaining their consent, randomly allocating the Standard and Test interventions to individuals and, once the intervention is introduced, making the appropriate assessments in order to determine outcome. The analysis of the data recorded for all patients will form the basis of the comparison between the intervention groups and (hopefully) provide a clear indication of their relative clinical importance and supply the framework for the subsequent clinical report describing the results. In the following sections, we will follow the sequence of Figure 2.1 even though this will not be entirely reflected in real-life situations of the planning stages where a back and to process is more likely. In general, there will also be additional and trial-specific steps that one must also make.

A

Steroid (S) [Standard]

S S E

Patients with oral lichen planus

Eligible and consenting patients

Random allocation to treatment

S S

Analysis and reporting

M

Cyclosporine (C) [Test]

E N T

Figure 2.1 Individually randomised controlled trial of the effect of topical steroid (S) and topical cyclosporine (C) in patients with symptomatic oral lichen planus. Source: Based on Poon, Goh, Kim, et al., (2006).

2.2

THE RESEARCH QUESTION

2.2

29

The research question

Of fundamental importance before embarking on a clinical trial is to identify the research question(s) of interest. Such questions may range from a very scientifically orientated objective to one focussed on a very practical day-to-day clinical situation. For instance, the trial of Example 1.11 aims to reduce the number of hip fractures in elderly residents of nursing homes, whereas one objective of those reported in Example 1.1 is concerned with gastric emptying times and is similar to a non-clinical laboratorybased investigation. Two key issues are as follows: Is the question worth answering? Is the answer already known? Clearly, the answer to the first should be unequivocally a ‘Yes’. For the second, one might expect a ‘No’, although there are circumstances when an earlier result may need confirmation. For example, Wee, Tan, Tai, et al. (2005) conducted a confirmatory trial of one previously conducted by Al-Sarraf, Pajak, Cooper, et al. (1990). The rationale for the repeat trial was based on the former trial involving mainly Caucasian patients, whilst the repeat was to involve those of predominantly Chinese ethnicity and to be conducted by different investigators in another part of the world. In the event, the advantage to chemo-radiation as compared to radiotherapy alone in terms of overall survival of patients with advanced nasopharyngeal cancer was confirmed thereby indicating wider generalisation of the results from the two trials. The question(s) posed must have important consequences in that the answer should inform research and/or influence clinical practice in a meaningful way. Further, before the trial is conducted, there should be a reasonable expectation that the trial question will be answered otherwise, for example, patients may be subjected to unnecessary investigation and possibly discomfort without justification. One exception to this condition may be when considering information from necessarily small randomised trials in truly rare diseases or conditions where patient numbers will be insufficient for the usual rules for trial size determination to be applied. We return to this latter issue in Chapter 20. Erbel, Di Mario, Bartunek, et al. (2007), see Example 1.12, state in the summary of their trial that: Coronary stents improve immediate and late results of balloon angioplasty by tacking dissections and preventing wall recoil. These goals are achieved within weeks after angioplasty, but with current technology stents permanently remain in the artery, with many limitations including the need for long-term antiplatelet treatment to avoid thrombosis.

This provides a clear rationale for testing a stent for coronary scaffolding made from a bioabsorbable material which should provide an effective scaffold but would not be permanently retained in the artery. Meyer, Warnke, Bender and Mülhauser (2003), Example 1.11, point out that hip fractures in the elderly are a major cause of disability and functional impairment so that reducing their incidence by encouraging the use of hip protectors within nursing homes may help to reduce this morbidity. The use of homoeopathic remedies is widespread although there have been few randomised trials to establish their relative efficacy against conventional methods

30

2 DESIGN FEATURES

(including placebo). Consequently, the trial of Example 1.4 by Stevinson, Devaraj, Fountain-Barber, et al. (2003) compares daily homoeopathic arnica against placebo in patients with carpal tunnel syndrome having elective surgery, as it has been claimed that: Homeopathic arnica is widely believed to control bruising, reduce swelling and promote recovery after local trauma: ….

At least to the investigators and also the corresponding journal editors and peer reviewers, these trials address important questions. The results of the trials suggest that the bioabsorbable magnesium stent is a useful development, encouraging the use of hip protectors reduces hip fracture rates, whereas homoeopathic arnica appears to be no better than placebo.

2.3 Patient selection Common to all clinical trials is the necessity to define precisely which types of subjects are eligible for recruitment purposes. This implies that even if healthy volunteers are to be the participants, then a definition of ‘healthy’ is required. This definition may be relatively brief or very complex depending on the situation. At the early stages of the design process, one may have only a general idea of the types of subjects required although identifying the research question may have already made this reasonably clear. Thus, elderly patients may be the first target group for whom preventative action to reduce hip fractures may be considered. Then, when considering the trial question in detail, it might be decided to confine the elderly patient group to those residents in a nursing home. Further refinement may then, for example, define the elderly for trial purposes as those over 80 years of age and exclude those nursing homes who deal with psychiatric residents only. Considerations here might have been an anticipated very low fracture rate in those under 80 years of age and the difficulties associated with obtaining fully informed consent from patients with psychiatric illnesses. These selection criteria are easy to understand, easy to determine and therefore easy to apply in practice. Eligible patients for the trial conducted of Example 1.7 by Meggitt, Gray and Reynolds (2006) in patients with atopic eczema had to satisfy an extensive list of criteria before they could be considered eligible for the trial. The general requirements specified patients 16–65 years of age with atopic eczema. However, only those with moderate-tosevere disease were to be included and this had to be determined according to the UK modification of the Hanifin and Rajka diagnostic criteria suggested by Williams, Burney, Hay, et al. (1994) which involves a detailed examination of the patient. Further, the trial excluded those: admitted to hospital for eczema; had used phototherapy or sunbeds; had been treated with cyclosporin, systemic steroids, Chinese herbal medicine, topical tacrolimus or evening primrose oil during the preceding 3 months; unstable or infected eczema in the previous 2 weeks requiring either highly potent topical steroids or oral antibiotics; TMPT activity 22 stones/140 Kg (upper weight limit for overlay mattress) a

Table 1 referred to here is similar to that of Figure 4.2 in the next chapter.

In addition to the Exclusion Criteria, in the PRESSURE (2000) protocol, there were rather complex Inclusion Criteria as three general types of patients were to be included. These patients were those having an admission to one of the designated hospital wards, and who were also one of the following three types: (i) Acute, (ii) Elective with existing

58

3 THE TRIAL PROTOCOL

pressure sore or with reduced mobility, and (iii) Elective with neither a pressure sore nor with reduced mobility. In the past, protocols tended to be restrictive about the patients admitted to trials and would focus on good-prognosis patients. Modern trials increasingly adopt the perspective that patients are eligible provided the clinician regards all the options for treatment within the protocol as potentially suitable for the patient under consideration, and provided the clinician acknowledges that it is objectively unclear as to which option is preferable. Thus, fewer eligibility criteria now specify an upper age limit, but are more likely to emphasise that the patient should be fit enough to tolerate side effects and toxicity.

3.8 Randomisation There are two aspects of the randomisation process that need to be described in the protocol. One is the structural features of the design while the other is the more procedural aspects of how patients will be randomised and hence allocated to a specific intervention. However, as we will discuss in Chapter 5, there is a difficulty in that if full details of a block based or other restricted randomisation schemes are made explicit in the protocol itself then the objectivity of the randomisation process may be compromised. So, we advise that such details are held in a confidential memo by the statistical team which is securely stored in the trial office files and, if need be, shared with the approving authorities but not with the clinical members of the protocol team until trial recruitment is complete. 3.8.1 Design Although Protocol AHCC01 (1997) referred in the statistical methods to a 2 : 1 : 2 randomisation between placebo (P), tamoxifen 60/mg/day (TMX60) and TMX120 no reference to ‘randomisation in blocks’ was made. Nevertheless, the eventual trial publication of Chow, Tai, Tan, et al. (2002) stated: Randomization was performed in balanced blocks of 5, stratified by center, and corresponding to P, TMX60, and TMX120 in the respective ratios of 2 : 1 : 2.

In contrast to randomised blocks, the PRESSURE (2000) trial used a minimisation method (see Chapter 5) with the four factors recruiting centre, skin condition, clinical specialty of the admitting hospital ward, and whether an acute or elective admission.

3.8

RANDOMISATION

59

Just as we have suggested that it is desirable not to reveal the block size, the precise amount of randomness set in a minimisation method should not be revealed within the protocol itself, but should be documented in a separate memo by the statistical team.

Example 3.14

Protocol PRESSURE (2000) Section 7.2: Pressure-relieving support surfaces: a randomised evaluation

7.2 Mattress Allocation Method Allocation to treatment will be by minimisation and with respect to the factors listed in Table 4. Table 4 Minimisation factors Factor

Levels

Centre Skin condition

×8 ×2

Ward specialty

×3

Admission type

×2

a

As Section 6a No pressure sore Existing pressure sore Vascular Orthopaedic Elderly care Acute Elective

Section 6 of the protocol named the eight participating centres.

3.8.2 Implementation Whatever the design features of the randomisation process the protocol also has to address the method by which this is put into operation. This may range from a relatively unsophisticated telephone call or interchange of fax messages, to a web or even a voicebased response system. Thus, the SQOLP01 (1999) protocol used a telephone-based randomisation system the details of which we give below while a more recent EXPEL (2016) trial described by Kim, Chen, Tay, et al. (2017) opts for a web-based.

60

3 THE TRIAL PROTOCOL

Example 3.15

Protocol SQOLP01 (1999, Section 7): Comparison of steroid with cyclosporine for the topical treatment of oral lichen planus

7. Randomisation After the potential OLP case has been confirmed according to the eligibility criteria and informed consent has been obtained, the patient should be randomised. To randomise a patient telephone NMRC CLINICAL TRIALS & EPIDEMIOLOGY RESEARCH UNIT Tel: (65) 220-1292, Fax: (65) 220-1485 Monday–Friday: 0830 to 1730, Saturday: 0830 to 1230

Some brief details will be collected for identification purposes and the caller will be informed of the result of randomisation and at the same time the patient will be assigned a protocol Trial Number.

Example 3.16 Protocol EXPEL (2016): Peritoneal lavage after curative gastrectomy Enrolment and randomisation Having obtained informed written consent, potential subjects will then undergo surgery. If the patients are confirmed to have cT3 or cT4 disease and are amendable to radical gastrectomy with curative intent, they will be formally enrolled into the study and undergo randomisation. Study participants will be randomised to EIPL arm or standard arm based on random permuted blocks with varying block size of four and six, assuming equal allocation between treatment arms. The randomisation is stratified according to individual study site. A web-based randomisation programme (https://rand.scri.edu.sg/) will be utilised to facilitate randomisation. The treatment code will be disclosed to the surgeon only at the end of gastrectomy. EIPL or standard lavage will then be performed according to the treatment code. The abdomen will then be closed as per standard.

The method of obtaining the randomisation that is chosen will depend on circumstances but the trend is now towards more automated systems. However, this trend does not preclude simpler – yet reliable approaches that are likely to be more viable for trials of modest size. The SQCP01 (2006) protocol for management of clefts of the secondary palate in infants provided for both a telephone- and a web-based randomisation option. The use of sealed envelopes by the clinical teams, as opposed to contacting a trials office remotely, is not regarded as an optimal method of allocation and should be avoided if at all possible. Whenever employed a clear justification for this is required. In the case of the investigators concerned with SQGL02 (1999) the nature of AACG, with its sudden onset and devastating consequences, provides the rationale.

3.9

ASSESSMENT AND DATA COLLECTION

Example 3.17

61

Protocol SQGL02 (1999): Brimonidine as a neuroprotective agent in acute angle-closure glaucoma (AACG)

Procedure for randomisation Due to the acute condition of AACG, sealed envelopes will be used for randomising the patients who will be more likely to be presented to the clinician after office hours.

3.9

Assessment and data collection

3.9.1 Assessments At some place in the protocol, an overview of the critical stages in patient management and key points of assessment needs to be provided. At each of these assessments, whether at the first presentation of the consenting individual before randomisation, post-randomisation at visits when active treatment will be given, or for visits merely for check-up purposes, the precise details of what examinations should be made and the details to be recorded (on the trial data forms) must be indicated. In SQCP01 (2006) even those children with cleft palate randomised to delayed surgery will have the same assessment schedule as those randomised to immediate surgery as these time points represent important milestones in, for example, their speech development.

Example 3.18 Protocol SQCP01 (2006): Comparing speech and growth outcomes between two different techniques and two different timings of surgery in the management of clefts of the secondary palate Method Infants will be recruited at age less than 12 months and followed up until 17 years of age. They will be assessed at age 18 months, 3, 5, 7, 9, 15 and 17 years.

The PRESSURE (2000) trial lists the sequences of assessments under different headings, and for each provides details of precisely what is required. We have indicated these headings in Example 3.19 but only for their section 8.2.3 have we specified the details. The number of assessments depends on whether or not the patient has, or subsequently develops, pressure sores and also on how long they remain in the hospital ward. As one would expect, this contrasts markedly with SQCP01 (2006) which stipulates seven assessment times scheduled at various growth and speech development stages of the children.

62

3 THE TRIAL PROTOCOL

Example 3.19

Protocol PRESSURE (2000): Pressure-relieving support surfaces: a randomised evaluation

8. ASSESSMENTS AND DATA COLLECTION 8.1 Registration and randomisation 8.2 Post randomisation assessments 8.2.1 Immediate 8.2.2 Daily 8.2.3 Twice weekly up to 30 days and then weekly up to 60 days The research nurse or designated ward nurse will record the following details twice weekly up to 30 days and then weekly up to 60 days or trial completion/withdrawal:

• Skin assessment (sacrum, buttocks, heels and hips) using the skin classification scale. • Mobility/activity/friction and shear/moisture/nutrition/sensory perception scores using Braden scale.

• Mattress checklist including: manufacturer, model, model number, type of mattress and confirmation that the mattress is alternating and working correctly. If the mattress has been changed by ward staff the reason for the change will be documented.

• Seating provision including model of chair or cushion • Confirm continued eligibility 8.2.4 Weekly up to 60 days 8.2.5 Patients with pressure sores 8.2.6 At trial completion/and or discharge

3.9.2 Data collection In contrast to the previous section, here the precise details of the items required at each assessment should be specified. The items might not be listed exhaustively but are often indicated by reference to the trial forms with a set of specimen forms bound into the protocol. In PRESSURE (2000), skilled personnel were trained about detailed aspects of the protocol, including the examination and documentation procedures. Whenever possible, it is important to complete the documentation as the examination proceeds. Investigators should not rely on making routine clinical notes and sometime later completing the trial-specific forms, as the notes will not have been designed for trial-specific purposes and important items may be omitted. The protocol should make clear which form or forms are applicable for each assessment – so numbering the different trial forms in a logical manner is important for this. We return to data collection forms in Section 3.14 below and give some examples in Chapter 4.

3.10 STATISTICAL CONSIDERATIONS

3.10

63

Statistical considerations

The principal tasks of the statistical team before formulating this section is to debate with the clinical teams’ issues relating to the final sample size chosen for the trial and to describe the main features of the subsequent analysis once the data are in hand. Straightforward statistical methods are preferable but not always feasible. The methods should be explained in lay terms and with appropriate, both accessible and understandable, reference material. 3.10.1 Number of subjects The number of subjects required will depend on the hypotheses to test, the trial design, the type of endpoint variables and whether or not characteristics of the patients themselves need to be taken into account. An important consideration when deciding the eventual trial size is whether such a recruitment target can be achieved in a reasonable time frame. In many instances, developing teams are overly optimistic in this regard. Although some details of the trial size process are somewhat routine, in that conventional wisdom dictates that the (two-sided) test size is set minimally at 5% and the power minimally at 90%, other details such as the anticipated effect sizes should be the subject of long discussion amongst the protocol development team. Neither should they accept the conventional wisdom just indicated without debate. Test size is discussed in Chapters 8 and 9 and power in Chapter 9.

Example 3.20

Protocol SQGL02: Brimonidine as a neuroprotective agent in acute angle-closure glaucoma (AACG)

9. STATISTICAL CONSIDERATIONS Trial size For sample size determination, it is anticipated that approximately 80% of patients receiving Timolol will experience visual field loss progression at 3 months post laser PI. With the hope that the proportion of Brimonidine treated patients having visual field loss progression will be reduced to 40%, a two-sided test, with 5% level of significance and power of 90% would require recruitment of 30 patients in each arm (Machin, Campbell, Fayers & Pinol, 19971). It is anticipated that after randomisation, 10% of the AACG patients may not respond to the initial medical treatment, and will require surgery instead of laser PI, and thus will not continue in the trial. After PI, it is expected that a further 10% of patients may not complete the trial and may withdraw from the study (see Withdrawal from Treatment). Taking into consideration the patients expected to fall out at each stage, we thus require approximately 80 AACG patients (40 patients per treatment group) for the trial. 1

Now updated as Machin, Campbell, Tan and Tan (2018).

64

3 THE TRIAL PROTOCOL

In this example, the protocol takes note of patient losses due to a relatively large proportion expected to need surgery (rather than laser peripheral iridotomy, PI) and compensates by adjusting upwards the number of patients to recruit. They do not stipulate; however, how these patients will be dealt with in the final analysis and reporting. One option is to regard all these as failures when calculating the proportions in each group with visual field loss at 3 months post PI. An alternative design possibility would have been to randomise patients after successful PI, so as to avoid recruiting patients who will provide little information on the relative merits of Timolol and Brimonidine. However, the design team may have discussed this (as well as other options) and rejected this for good reasons. 3.10.2 Analysis As with determining the trial size, the format of the analysis will depend on the type of questions being posed, the trial design, the type of endpoint variables and whether or not characteristics of the patients themselves need to be taken into account. If there is only a single endpoint concerned, then the plan for the subsequent analysis may be described rather succinctly except in circumstances where the analysis may be unusual in format. In some situations, several endpoints will be included so that a careful description of the analytic approach for each needs to be detailed. In these circumstances, multiple statistical tests may be concerned and cognisance of that may need to be taken into account.

Example 3.21

Protocol SQOLP01 (1999): Comparison of steroid with cyclosporine for the topical treatment of oral lichen planus

Analysis plan Primary endpoints Analysis of the primary endpoints of response and pain at 4 weeks will be made on an intention-to-treat basis. Clinical response at 4 weeks Comparison of the observed clinical response rates in the two treatment groups will be made using the χ2-test for the comparison of two proportions and a 95% confidence interval for the difference reported. In addition, logistic regression analysis will verify if this comparison requires adjustment for imbalance in the baseline clinical assessment values. Pain score 4 weeks Comparison of means in the two treatment groups will be made using the t-test with the appropriate degrees of freedom and a 95% confidence interval for the difference reported. (Should the VAS scores not follow an approximate Normal distribution shape then this

3.10 STATISTICAL CONSIDERATIONS

Example 3.21

65

(Continued)

analysis may be replaced by the Wilcoxon test). In addition, regression analysis will verify if this comparison requires adjustment for imbalance in the baseline VAS values and clinical assessment values. Secondary analyses In those patients for which the marker lesion is measurable by the grid, the mean of the total area remaining affected at 4 weeks for each treatment group will be compared using the t-test and 95% confidence interval for the difference reported. In addition, regression analysis will verify if this comparison requires adjustment for imbalance in the baseline target lesion area values as well as its location and other clinical assessment values. A more detailed longitudinal analysis will make use of all the individual measures (maximum 8 per patient) using the area under the curve (AUC) as the summary statistic for each patient. The mean AUC over all patients within the treatment group is calculated and then between treatment comparisons made using the t-test. A full statistical modelling approach using generalised estimating equations (GEE) is also anticipated. However, details of this latter methodology may have to await examination of the final data available. For example, this methodology would not be appropriate if all patients achieved complete response by week 4 and all subsequent target areas were then zero. In similar ways, the complete VAS for burning sensation profile will be summarised. Adverse events It is not anticipated that a formal statistical analysis comparing the adverse event rates between groups will be conducted. However, full details will be presented and their presence (if any) used to contextualise any observed treatment differences. Additional analysis Clinical response at week 8 will also be associated with initial (week 4) response to give an indication of the duration of response and to report any recurrences.

This example details a number of analyses. Some of the repetition could be reduced by replacing ‘adjustment for imbalance in the baseline’ by ‘adjustment for baseline’. Although this is very trivial example, if the number of words can be reduced without loss of clarity, then this reduces the eventual size of the document, facilitates the proof reading, and eases the job of the approving authorities. 3.10.3 Interim analysis As part of the monitoring of the progress of trials, while recruitment is still ongoing, many protocols include interim looks at the trial data and these may be reviewed by an independent Data and Safety Monitoring Board (DSMB) more details of which are given in Chapter 10. One example of the remit given to a DSMB is in Example 3.22, which is taken from the EXPEL (2016) protocol as described by Kim, Chen, Tay, et al. (2017).

66

3 THE TRIAL PROTOCOL

Example 3.22

Protocol EXPEL (2016): Peritoneal lavage after curative gastrectomy

An independent data monitoring committee (DMC) has been formed to monitor data and safety as this is a large clinical trial with multiple study sites. The DMC consists of subject matter experts who are not part of the EXPEL study team. An interim analysis will be performed after at least 200 subjects have completed 6-months follow-up. The purpose of the interim analysis is two-fold: firstly, to mitigate the risks from this trial should any unexpected serious adverse effects surface. Secondly, to allow for preliminary analysis of data to assess the efficacy of EIPL as compared with standard therapy. This would allow early closure of the trial should the efficacy be unequivocally proven, or modification of the trial with regard to increasing the sample size should it be deemed inadequate to show a more modest improvement in OS.

3.11

Ethical issues

There are both general and specific issues that should be addressed in this section. These may range from ensuring that the protocol conforms to the internationally agreed Helsinki agreement and more recent Good Clinical Practice (GCP) regulations that have been adopted by many countries, through to obtaining local approval of the informed consent processes being applied. 3.11.1 General There are internationally agreed standards under which clinical trials are conducted and these are encapsulated in GCP which is referenced as ICH E6 (R2) (2016). Then there are the more specific requirements, perhaps national as well as local, that must be adhered to. The precise details of these will vary with the trials’ geographical location, whether single or multicentre in design; local, national or international; the type of interventions under study, and the intended target participation groups.

Example 3.23

Protocol PRESSURE (2000): Pressure-relieving support surfaces: a randomised evaluation

The study will be conducted in accordance with the Declaration of Helsinki in its latest form. The study will be submitted to and approved by a Regional Multicentre Research Ethics Committee (MREC) and the local Research Ethics Committee (LREC) of each participating centre prior to entering patients into the study. The NYCTRU will provide the LREC with a copy of the final protocol, patient information sheets and consent forms. The conduct of the trial will be monitored by a Trial Steering Committee consisting of an independent Chair and two independent advisors as well as the project team (Appendix H).

3.11 ETHICAL ISSUES

67

Note this statement acknowledges that the trial development team are aware of potential updates of the Declaration of Helsinki. This will also be the case for any of the regulatory and other codes of practice that may pertain, as these too are continually being revised to meet new circumstances. However, if regulatory changes are made, they may or may not be invoked for ongoing protocols that have current approval status. A major ethical requirement is to ensure that the potential participants in trials understand that participation is voluntary, and that they are free to withdraw their consent at any time and that doing so will not in any way compromise the future treatment that they will receive.

Example 3.24

Protocol SQGL02 (1999): Brimonidine as a neuroprotective agent in acute angle-closure glaucoma

The right of the patient to refuse to participate without giving reasons must be respected. After the patient has entered the trial the clinician must remain free to give alternative treatment to that specified in the protocol at any stage if he/she feels it to be in the patient’s best interest. However, the reasons for doing so should be recorded and the patient will undergo an early termination visit for the purpose of follow-up and data analysis. Similarly, the patient must remain free to withdraw at any time from protocol treatment without giving reasons and without prejudicing his/her further treatment.

3.11.2 Informed consent Before any trial can take place, individual subjects have to be identified, and formal processes for their consent will have to be instituted. The precise details will depend on the type of trial contemplated, for example, whether it involves an invasive procedure, concerns primary intervention, or has therapeutic intent. Also, if it involves participants such as children, the very elderly, healthy volunteers, the terminally sick or women of fertile age then this may raise particular issues, for example, requiring proxy consent, or reassurance that the trial drugs will not compromise subsequent fertility. Importantly the patient (or the consent giver) should understand that no-one knows in advance which therapy will be allocated, and that they should be willing to accept whatever the allocation may be. If they are unwilling, they should not be recruited to the trial. The ideal is that each patient or volunteer gives fully informed and written consent. However, departures from this will be appropriate in specific circumstances. For example, such departures may concern patients that are unconscious at admission to hospital, patients with hand burns that are so severe that they affect their ability to provide their signature, very young children or those mentally compromised. In these cases, a proxy may consent for them or in the case of those with severe burns witnessed verbal consent may be substituted.

68

3 THE TRIAL PROTOCOL

Example 3.25

Protocol SQGL02 (1999): Brimonidine as a neuroprotective agent in acute angle-closure glaucoma

Before entering patients on the study, clinicians must ensure that the protocol has received clearance from their local Ethics Committees. The patient’s consent to participate in this trial should be obtained after full explanation has been given of the treatment options, including the conventional and generally accepted methods of treatment, and the manner of treatment allocation.

All the possible options on trial should be explained impartially to the patients concerned. This explanation must be provided before the randomisation is realised as knowledge of the assignment may influence the way in which an investigator explains the alternatives. A key feature of the informed consent process is to explain the randomisation procedure and to emphasise that participation is completely voluntary and that the patients can withdraw from the protocol at any time. As is often the case, new challenges arise and the 2020 COVID-19 pandemic has had a profound and immediate impact on the conduct of clinical trials. Even without the lockdown and the introduction of circuit breaker in place, due to the infectious nature of the corona virus, it was deemed that the risks would be too high to continue with research activities requiring face-to-face contact. As a consequence, in a proposed cluster randomised trial to be conducted in Singapore on the safety and efficacy of hydroxychloroquine in households with index COVID-19 cases, the study team proposed to contact those concerned via telephone to assess their interest to participate in the clinical trial. At the same time, the email address of the household contact would be obtained. Once the eligible participant had ample time to read the electronic informed consent form (e-ICF), the study team would conduct the informed consent discussion via video consultation. If the subject then wished to take part in the trial, he/she would be required to digitally sign the e-ICF using Docusign, in the presence of an impartial witness during the video consenting process.

Example 3.26

eConsent – acute stroke

Haussen, Doppelheuer, Schindler, et al. (2017) describe the use of an electronic consent (e-Consent) tool to facilitate the informed consent process in trials involving patients with acute ischemic stroke. Because of impairment of language and cognitive status in these patients, the requirement of a physical signature of a legally authorised representative during the consent taking process hinders patient enrolment. The use of e-Consent was first piloted in a trial of patients, with large vessel occlusion stroke. Eligible patients, who presented within 6–24 hours of last seen as normal, were randomised to either stent-retriever thrombectomy or best medical therapy. Although the trial commenced recruitment in January 2015, the process of e-Consent was only approved by the institutional review board in December 2016, 2 months before enrolment to the trial ended in February 2017.

3.13 PUBLICATION POLICY

3.12

69

Organisational structure

The contents of this section will depend to a large extent on the size and complexity of the design of the trial. For example, protocol SQCP01 (2000) involves only two clinical centres but different countries with very different first languages, and which are geographically quite distant. Because this trial is organisationally complex, and involves many clinical disciplines, including specialists in surgery, orthodontics, and speech therapy each with roles to enact over a very long period of 17 or more years, keeping track of the individuals concerned is a major challenge. However, each centre has long experience of dealing with such complex issues so the protocol runs with four named craniofacial/plastic surgeons, four orthodontic coordinators, three speech therapy coordinators and two research coordinators. At the statistical centre a medical statistician and clinical project coordinator are designated to the trial. The protocol contains full names, addresses, telephone, fax and email addresses of these individuals. As may imagined, the actual individuals concerned will no doubt change as the trial progresses forward in time. In contrast, the SQNP01 (1997) was conducted within a single centre and three clinical coordinators were identified, one representing radiation oncology, the other two medical oncology. The conduct of this trial reflected the day-to-day management practices of the centre concerned. At the statistical centre, a medical statistician was designated and a nurse coordinator had shared responsibilities for the trial within both the statistical centre and the clinic from where the patients were recruited. The protocol contains full names, addresses, telephone, fax and email addresses of these individuals. In situations where the demands of a trial are somewhere between these two extremes, the protocol development team must ensure that the necessary organisational structure is in place and each component thereof knows of their individual responsibilities. One important role of the trial office is to help maintain this functionality throughout the life of the trial. If at all possible, the protocol should fit as closely as possible within the confines of current practice in the centres concerned, with the proviso that the aims of the trial are not compromised by so doing. This facilitates acceptance of what is new in the protocol from the local team and thereby should help with the smooth running of the trial and hopefully maximise recruitment rates.

3.13

Publication policy

In most clinical trial groups, there will inevitably be several members of the team and should the trial be multicentre then the full team of collaborators may be very numerous. It is useful to have stipulated a clear policy as to who the authors of the final publication will be and in what order they appear on the title page. Provided this is clear, and agreed by all concerned (including latecomers to the trial once it is ongoing) this need not be in included in the protocol itself. If a large number of collaborators are

70

3 THE TRIAL PROTOCOL

involved then it may be more sensible to publish under a group name with a full list of the investigators included as an appendix to the report but again, this should have been discussed and agreed from the outset.

Example 3.27

Protocol SQOLP01: Comparison of steroid with cyclosporine for the topical treatment of oral lichen planus

PUBLICATION POLICY The results from the different centres will be analysed together and published as soon as possible. Individual clinicians must not publish data concerning their patients which are directly relevant to the questions posed by the trial until the main report is published. This report will be published under the name of the Asian Lichen Planus Collaborative Study Group listing all members of the group and any others contributing investigators.

In the event, the editor of the journal publishing the eventual report of Poon, Goh, Kim, et al. (2006) refused publication under a group name although a full list of contributors was permitted. Thus, collaborative groups may need to verify the policy of the target journals before stipulating a formal policy in this respect. Example 3.27 also underlines an important requirement that individual groups should ‘not publish data concerning their patients which are directly relevant to the questions posed by the trial until the main report is published’. An important reason for this is that such publication can only (at best) refer to a subgroup of the total number of patients recruited to the trial. This number is consequently less than that stipulated by the design and so any analysis will be underpowered for the hypotheses under test. So, for example, such an analysis may report ‘no statistical difference’ in situations where the whole trial data may conclude the opposite. Premature publication by an individual group may also jeopardise the acceptance for publication of the report of the final trial results.

3.14

Trial forms

As we recommended earlier, since recording the patient data is integral to successful trial conduct, inclusion of the trial forms into the protocol itself is often desirable, even when they are quite simple in structure. However, if these are web rather than paperbased how these are to be presented to any protocol review committee may depend on local circumstances. The forms should be developed in parallel with the protocol, and may (depending on local regulations) have to be submitted for approval with the trial protocol itself in any event. The number, structure and complexity of the forms required for a trial will be

3.15 APPENDICES

71

very trial-specific but, as a minimum, there will be forms containing subject-specific information relevant to the registration and randomisation process including the intervention assigned, those encapsulating eligibility and other baseline characteristics of those recruited, and a form for the endpoint assessment. In almost all circumstances there will be many more than this and most trials will include special forms for recording details of, for example, any surgical procedures undertaken or unexpected adverse events should they arise. In general terms, the forms for the clinical trial should focus on essential detail that is necessary to answer the question(s) posed by the design and should not be cluttered with irrelevant items. This focus keeps the clinical teams aware of the key issues and takes them less time than having to record inessential details. Consequently, there are likely to be fewer errors. The completed form also becomes easier to check if there are fewer items and thereby reducing the data management processes and speeding up checking. In this way, any problems remaining can be fed back to the clinical teams more rapidly – which again reduces the workload at the clinical recruiting centre. The briefer the forms, and indeed the simpler the trial procedures, the easier it becomes for collaborators and the more rapidly they are likely to recruit the patients required. But this must be balanced against the need to ensure that the forms do contain all the necessary information that will be required for the analysis. However, the experience of many groups indicates that most trials collect far too much information that is then never analysed nor reported on. Forms may need to include patient management details, such as the date of the next follow-up visit, or to confirm if an action has been taken such as the despatch of a laboratory specimen or the completion by the patient of a quality-of-life questionnaire. It is best if these are kept to a minimum, and located in a distinct part of the form (perhaps the last items) so that when the forms are received for processing at the trials office these items can easily be distinguished from variables that must be included in the trial database.

3.15

Appendices

As often as not, a protocol will almost certainly have to contain Appendices. For example, in many cancer clinical trials toxicity is a major concern so that the criteria for reporting adverse events as recommended by the National Cancer Institute (2003) will often be reproduced. As informed consent is such a critical process, reviewing committees will almost certainly wish to see the proposed patient information sheets and the consent forms to be used. Figure 3.2 gives part of the patient information provided and Figure 3.3 the corresponding consent form for the COMPLIANCE (2015) trial protocol described by He, Tan, Wong, et al. (2018) concerning compliance with medication in women with breast cancer.

72

3 THE TRIAL PROTOCOL

1. Study Information Protocol Title: Improving medication adherence with adjuvant aromatase inhibitor in women with breast cancer: a randomised controlled trial to evaluate the effect of short message service (SMS) reminder 2. Purpose of the Research Study You are invited to participate in a research study. It is important to us that you first take time to read through and understand the information provided in this sheet. Nevertheless, before you take part in this research study, the study will be explained to you and you will be given the chance to ask questions. After you are properly satisfied that you understand this study, and that you wish to take part in the study, you must sign this informed consent form. You will be given a copy of this consent form to take home with you. You are invited because you have been receiving adjuvant (post-operative) endocrine therapy (or anti-hormonal therapy) for at least 2 years due to breast cancer, and are continuing to receive an anti-hormonal drug belonging to a class of drugs called aromatase inhibitor (AI), such as letrozole, anastrozole or exemestane, for at least one more year. This study is carried out to find out if a computer generated short message service (SMS) reminder improves medication compliance amongst breast cancer women receiving oral AI therapy, and whether it can be used as a tool to remind patients to take medication as prescribed. The study will recruit 280 subjects from the National University Hospital, over a period of one year, from May 2015 to April 2016. The patients will be followed up for an additional year after recruitment.

Figure 3.2 Part of the Information Sheet utilised in the multicentre COMPLIANCE (2015) trial in patients with breast cancer. Source: COMPLIANCE (2015).

Figure 3.3 facilitates the consent approval process in cases when a prospective patient may not speak, in this case, English, is illiterate, or is otherwise compromised.

3.16

Regulatory requirements

3.16.1 Protocol amendments Although great care should be taken in preparing the trial protocol, once the approved trial has opened for patient recruitment and is in progress, unforeseen circumstances may arise that impact on what is contained within the protocol. Such circumstances could range from the relatively trivial to the very serious. At one extreme, perhaps the packaging of a study drug is changed by the supplying pharmaceutical company without change to the potency or any significant aspects. At the opposite extreme, perhaps unanticipated and serious reactions in some patients occur, raising concerns about whether the trial medication is safe and consequently impacting on whether or not the trial should continue as originally planned. The consequences of the latter might for example either result in restricting the trial entry criteria by identifying those who are likely to be vulnerable and making them no longer eligible, or reducing the dose should the anticipated reaction occur. Both of these represent an important change to the protocol. The protocol would then have to go through a reapproval process. In contrast, the minor change in packaging may only require informing the authorities

3.16 REGULATORY REQUIREMENTS

73

Protocol Title: Improving medication adherence with adjuvant aromatase inhibitor in women with breast cancer: a randomised controlled trial to evaluate the effect of short message service (SMS) reminder I voluntarily consent to take part in this research study. I have fully discussed and understood the procedures of this study. This study has been explained to me in a language that I understand. I have been given enough time to ask questions that I have about the study, and all my questions have been answered to my satisfaction. By participating in this research study, I confirm that I have read, understood and consent to the National University Health System (NUHS) Personal Data Protection Notification. I also consent to the use of my Personal Data for the purposes of engaging in related research arising in the future.

Name of Participant

Signature

Date

Translator Information (to be completed for illiterate subject/parent/legally acceptable representative only) The study has been explained to the participant/legally acceptable representative in

by

Witness Statement (to be completed for illiterate subject/parent/legally acceptable representative only) I, the undersigned, certify to the best of my knowledge that the participant signing this informed consent form had the study fully explained in a language understood by her and clearly understands the nature, risks and benefits of her participation in the study. Name of Witness

Signature

Date

Investigator Statement I, the undersigned, certify that I explained the study to the participant and to the best of my knowledge that the participant signing this informed consent form clearly understands the nature, risks and benefits of her participation in the study. Name of Investigator/ Signature Person administering consent

Date

Figure 3.3 Consent form designed to obtain assent from a patient to be randomised in the multicentre COMPLIANCE (2015) trial in patients with breast cancer. Source: COMPLIANCE (2015).

74

3 THE TRIAL PROTOCOL

of this fact. Of course, in this instance, if the protocol has to be changed for any other minor reason(s), then it would be prudent to make this change(s) at the same time. Since protocol modifications are not infrequent, it is wise to keep the protocol as concise as possible; exclude all irrelevant detail; ensure main sections start on new pages; ensure page breaks do not break paragraphs (perhaps not important in the Background but may be critical if describing details of an intervention); number sections, tables and figures in such a way as to minimise the need for future renumbering or repagination of the protocol. Without such precautions, there can be severe consequences if any additions or modifications happen to occur in sections from the early pages of the protocol.

3.17

Guidelines

As we have indicated GCP, set out in ICH E6 (R2) (2016), will dictate in full the items that are mandatory for such a protocol. Similarly, the SPIRIT 2013 statement provides recommendations for a minimum set of scientific, ethical and administrative elements that should be addressed in a clinical trial protocol. It is particularly important, and especially for clinical trials seeking formal registration of a new product, that investigators check local, national and even international requirements for what has to be included in the protocol itself. The definition of what is a ‘protocol’ given by Day (2007) and slightly amended in our Glossary includes the phrase ‘important details’ so it is imperative to check the current status of exactly what current versions of the guidelines are suggesting as they are continually changing. For example, the ICH E6 (R2) (2016, Section 6) specifies for protocols sections on: Direct access to source data/documents; Quality control and quality assurance; Data handling and record keeping: Financing and insurance which we do not include in Figure 3.1. In this document Section 8 also includes: Before the clinical phase of the trial commences; During the conduct of the trial; After completion or termination of the trial. Although these sections may be more appropriate to trials for products seeking regulatory approval, they contain many items pertinent in a wider context. Day (2007) in Appendix 1 of his dictionary lists 18 ICH ‘Efficacy’ Guidelines. The latest updates of these can be obtained from the ICH official website: http://www.ich.org. ICH E6 (R2) (2016). Guideline for Good Clinical Practice. EMA/CHMP/ICH/135/1995. ICH E9 (R1) (2018). Statistical Principles for Clinical Trials. CPMP/ICH/363/96. SPIRIT 2013 Statement (2013). Defining Standard Protocol Items for Clinical Trials. http://www.spirit-statement.org (see Chan, Tetzlaff, Gøtzsche, et al., 2013).

3.18

Protocols

AHCC01 (1997). Randomised trial of tamoxifen versus placebo for the treatment of inoperable hepatocellular carcinoma. Clinical Trials and Epidemiology Research Unit, Singapore.

3.18 PROTOCOLS

75

ENSG5 (1990). Comparison of high dose rapid schedule with conventional schedule chemotherapy for stage 4 neuroblastoma over the age of one year. UKCCSG, Leicester, UK. COMPLIANCE (2015). Improving medication adherence with adjuvant aromatase inhibitor in women with breast cancer: study protocol of a randomised controlled trial to evaluate the effect of short message service (SMS) reminder. EXPEL (2016). Extensive peritoneal lavage after curative gastrectomy for gastric cancer (EXPEL): an international multicentre randomised controlled trial. PRESSURE (2000). Pressure relieving support surfaces: a randomised evaluation. University of York, York and Northern and Yorkshire Clinical Trials Research Unit, Leeds, UK. SQCP01 (2006). A randomised controlled trial comparing speech and growth outcomes between 2 different techniques and 2 different timings of surgery in the management of clefts of the secondary palate. KK Women’s & Children Hospital, Singapore and Chang Gung Memorial Hospital, Taiwan. SQGL02 (1999). Brimonidine as a neuroprotective agent in acute angle closure glaucoma. Clinical Trials and Epidemiology Research Unit, Singapore. SQNP01 (1997). Standard radiotherapy versus concurrent chemo-radiotherapy followed by adjuvant chemotherapy for locally advanced (non-metastatic) nasopharyngeal cancer. National Medical Research Council, Singapore. SQOLP01 (1999). A randomised controlled trial to compare steroid with cyclosporine for the topical treatment of oral lichen planus. Clinical Trials and Epidemiology Research Unit, Singapore. UKW3 (1992). Trial of preoperative chemotherapy in biopsy proven Wilms’ tumour versus immediate nephrectomy. UKCCSG, Leicester, UK.

CHAPTER 4

Measurement and Data Capture

This chapter emphasises for all trials the essential requirement of taking appropriate measurements. The importance of clearly defining those measurements needed to determine the endpoint(s) and how, when, and by whom, they are to be taken is underlined. The particular value of the masked assessment is stressed as is the necessity to make the observations with sufficient precision, avoiding bias and recording the data in a suitable medium. Examples of different types of endpoint and problems associated with their determination are described. We stress that it is vital that all data collection forms of whatever type are clear, easy to complete and readily transferable to the trial database.

4.1

Introduction

In every clinical trial information will have to be collected on the subjects included as they progress through the trial until the endpoint(s) is determined. Thus, an important aspect of trial design is the choice of measurements to be made and observations to be recorded. Once identified, details of how and when these measures are to be taken also have to be specified. Some of these measures may be very straightforward to determine, such as gender or date of birth, whilst others may require detailed and invasive clinical examination followed by laboratory assessments before they can be finally determined. Once available, the data collected need to be recorded in a logical and consistent manner. We will assume for didactic reasons that these will be on previously designed paper forms which will be entered onto a computerised and interactive trials database. At an early stage, it is important to distinguish between several classes of data. There are those that are collected for purely descriptive purposes, to characterise the patients in the trial, and which are either weakly or not at all related to the outcome measures. There are those that are known to be prognostic for the outcome and possibly used when allocating subjects to the interventions for stratification purposes or may need to be taken account of when assessing the trial results. There are those which may be important for purely management purposes, for example, the date a pathology specimen was centrally reviewed (in this case the review panel outcome may be vitally important but not the date the review was undertaken). There may be data related Randomised Clinical Trials: Design, Practice and Reporting, Second Edition. David Machin, Peter M. Fayers, and Bee Choo Tai. © 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.

78

4 MEASUREMENT AND DATA CAPTURE

to the safety of the patient, so the date of the unanticipated event plus details of the event itself will both be essential. Most important of all will be the intervention allocated at randomisation and the endpoint variable(s) which are to be used for the comparison of these interventions. Crucial to these are the date of randomisation and the date(s) at which the endpoint variables are determined. It should be emphasised that the class for a specific variable, for example age of the participants, is not determined by the nature of the variable itself but rather upon its use in the trial synthesis. Thus, age may be purely descriptive in the context of one clinical trial in which it has no prognostic influence whereas it will be prognostic in another where age is known to be an important determinant of outcome. For example, in children with neuroblastoma age is a very strong predictor for ultimate event-free survival time. The data collected in a clinical trial setting should not contain extraneous information which is not essential for the progress of the trial or for its final synthesis. There is often a temptation to record ‘interesting’ information collected as an aside from the main thrust of the trial. The relevance of any such information must be weighed by the design team before the trial commences and, if not vital, it should not be recorded. Also, in the data checking and verification processes, the relative importance of variables should always be appreciated. For example, if the date of the pathology review examination is missing yet the outcome recorded, sensible judgement has to be made as to whether the actual date needs to be pursued. Even if not felt important for trial analysis purposes the missing data may be a requirement of Good Clinical Practice (GCP) and perhaps those responsible for the omission may have to be contacted to provide it. As we have indicated, it is important also to distinguish variables which are to be used for descriptive and those for more analytical purposes. Comparisons between the randomised intervention groups of the descriptive variables are themselves also descriptive and hence no statistical tests are involved. In contrast, an endpoint variable is analytical and will be used for testing the trial’s main hypothesis and for calculating the corresponding confidence interval. Further, although we will give some details of analysis in Chapter 8 and some general expressions for trial size in Chapter 9, these will have to be selected and/or modified depending on the type of analytical outcome measure.

4.2 Types of measures 4.2.1 Qualitative data 4.2.1.1 Nominal or unordered categorical Nominal data are data that one can name thus they are not measured but simply counted. They often consist of binary or dichotomous ‘Yes or No’ type observations, for example, Dead or Alive; Male or Female; diagnosis following infection by the coronavirus (COVIID-19). The corresponding summary statistic is the proportion (or percentage) of subjects within each category, often denoted by p, and the difference between two groups is p2 − p1. In this list, Dead or Alive may well be an endpoint variable while the others may be mainly descriptive of the type of patients to be recruited or characteristics of such patients. However, they can have more than two categories, for

4.2

TYPES OF MEASURES

79

example, country of origin, ethnicity or blood group: O, A, B or AB; all of which are unlikely endpoints. In clinical trials binary endpoints are frequent, as in success/failure and cured/not cured, but unordered categorical data is rarely an endpoint. However, this latter type quite frequently arises, for example when histology (cell type) is used as a prognostic indicator for survival in cancer trials, or marital status as an indicator for time to discharge from hospital in trials concerned with care of the elderly. 4.2.1.2 Ordered categorical or ranked If there are more than two categories it may be possible to order them in some way. For example, after treatment patients may either improve, be the same or worse; the diagnosis of COVID-19 may be suspected, probable or definite, while for their ‘Global response’ endpoint Meggitt, Gray and Reynolds (2006, Table 3) categorise patients at the end of the treatment period as follows: Worse, No Change, Slight improvement, Moderate improvement, Striking improvement or Complete resolution of eczema. In this situation, it may be appropriate to assign ranks and to utilise these as numerical values. 4.2.2 Numerical or quantitative data Numerical discrete data are counts such as 0, 1, 2, and so on, for example the number of episodes of migraine in a patient in a fixed period say of 4 weeks following the start of treatment. Depending on the distribution of the resulting trial data, the mean and standard deviation (SD) or the median and range might be used as summary statistics. In practice, a reduction in migraine episodes may be regarded as a partial success, but the percentage of patients with zero episodes might be considered as the most important single summary statistic. In contrast, numerical continuous data are measurements that can, in theory at least, take any value within a given range. Examples are the descriptive variable maternal age (year) at babies’ conception, and the endpoint variable the weight of the baby (g) at delivery. The corresponding summary statistics are often the mean, x, and the corresponding standard deviation. 4.2.3 Time-to-event In many disease areas, survival is the most obvious and most important endpoint, and this has led to event-time analysis being loosely called survival analysis. In general, a time-to-event measure is the interval from randomisation until the patient experiences a particular event, for example, the healing of their burn wound in the trial of Example 2.2. The key follow-up information will be that which is necessary to determine healing. For example, burn healing might be defined as the final closing of all damaged body surface area. To establish this, the burn area may have to be monitored on a daily basis to determine exactly when this final closure is achieved. For those patients in whom healing occurs, the time from randomisation to healing can be determined in days by calculating the difference, t, between the date of complete

80

4 MEASUREMENT AND DATA CAPTURE

healing and the date of randomisation. For those whose wound does not heal with medical treatment, but have to be excised or amputated, then the time from randomisation to this can be assessed but not their healing time. Their data are therefore classified as censored at the time of operation. The time from the date of randomisation to this censoring date is termed T+. The eventual analysis of these ‘survival’ times involves either a t or a T+ for every patient. 4.2.4 Multiple sites per subject In some situations, a trial may be designed which enables a single intervention to be given to a patient but multiple measures of the same outcome may be possible. For example, if the trial is concerned with patients with glaucoma then the patient may have one or two eyes affected which may be treated by oral medication. Those with one affected eye would then (say) have the intraocular pressure (IOP) assessed for that eye at a certain point postrandomisation whereas if both eyes are affected two IOP assessments would be made. At the end of the trial, the two interventions will be compared for their effect on IOP which will come from (say) equal numbers of patients in each group but some of whom will contribute one reading and some two. The two-site situation can be extended to situations where there are multiple sites involved. Thus, in treating patients for pressure sores, there may be several potential pressure sites as well as sores to monitor. In such situations, the recruited subject is regarded as providing a cluster of k ≥ 2 outcome observations and special attention to this is required at the analysis stage. 4.2.5 Practice The eventual design of the trial depends crucially (amongst other things) on the type of endpoint(s) chosen as this determines the type of statistical analysis that will be required and this then influences the number of subjects that need to be recruited to the planned trial. The required trial size (discussed in Chapter 9) will usually be far greater for binary data compared to continuous, and so it is better for example to assess pain on a NRS-11 scale 0–10 rather than merely classify patients into pain requiring (or not requiring) analgesia. Similarly, analysis of survival data may require observations of many events (for example deaths) per group, so that trials involving hundreds of patients are often necessary. Thus, a vital component when considering design options is the type of measurement(s) to be undertaken.

4.3 Measures and endpoints In general, a typical trial report will include information on a range of different variables including demographic, prognostic and endpoint. An illustration of some of these is given in Figure 4.1 and which are extracted from a report of a randomised trial

4.3

MEASURES AND ENDPOINTS

81 MP

MPT

n

126

129

IIA

49 (39%)

50 (39%)

IIB

3 (2%)

4 (3%)

IIIA

62 (49%)

64 (50%)

IIIB

12 (10%)

11 (9%)

Creatinine

n

125

129

(mg/L)

Median

8

8

Range

6.0–68

5.6–102

n

126

129

Median

72

72

80

7 (6%)

6 (5%)

β2-microglobulin

n

110

116

(mg/L)

Median

3.7

3.7

Range

0.20–37.5

0.36–40.0

≤3.5

53 (48%)

53 (46%)

>3.5

57 (52%)

63 (54%)

Haematological

Grade 3–4

29 (22%)

32 (25%)

0.59

Gastrointestinal

Grade 3–4

8 (6%)

1 (1%)

0.036

Baseline Stage

Prognostic Age (years)

Adverse events

p-value

Endpoints Response at 6 months

Event free survival

Complete or partial

98 (76.0%)

Progression, relapse or death

42 (32.5%)

2-year rate

54%

60 (47.6%)

62 (49.2%)

Difference

95% CI

28.3%

16.5 to 39.1

HR

95% CI

0.51

0.35 to 0.75 p-value = 0.0006

27%

Figure 4.1 Selected baseline, adverse event and endpoint results from a trial in patients with multiple myeloma treated by melphalan and prednisone (MP) alone or with the addition of thalidomide (MPT). Source: Modified from Palumbo, Bringhen, Caravita, et al. (2006).

82

4 MEASUREMENT AND DATA CAPTURE

comparing two treatments for elderly patients with multiple myeloma conducted by Palumbo, Bringhen, Caravita, et al. (2006). In this example, it was not immediately clear whether stage, creatinine clearance, age, or β2-microglobulin were merely descriptive or of prognostic importance. However, the Statistical analysis section of the published report indicates that age, as one would perhaps expect in an elderly group of patients, and β2-microglobulin are indeed regarded as prognostic factors. In this case, it is unfortunate that the latter measure is missing for 11% (29/255) of patients. When it is known that a particular variable is to be used for prognostic purposes, in that it will be taken account of in the intervention comparison, then it is vital that the investigators are made fully aware of its importance. The management team must do all it can to ensure the variable is fully recorded. In practice, this is not as easy as it may sound particularly in the context of this trial which involved 54 different centres spread over a wide geographical area. Figure 4.1 also includes the Grade 3–4 events of two of the ten types of adverse events reported. Each of these, although not the primary outcome variables, is compared statistically in the format shown. Endpoint variables included response, as defined by the European Group for Blood and Transplantation/International Bone Marrow Transplant Registry and described by Bladé, Samson, Reece, et al. (1998), and event-free survival (EFS) defined as the time from diagnosis until the date of progression, relapse, death from any cause, or the date the patient was last known to be in remission. Response rates were compared and the hazard ratio (HR) calculated with both endpoint summaries suggesting an advantage to MPT. 4.3.1 Assessments In any trial, some of the assessments made may focus on aspects of the day-to-day care of the patient whilst others may focus more on those measures that are necessary to determine the trial endpoint(s) for each subject. It is important that these assessments are well defined and that endpoints are unambiguously defined so that they can be determined for each patient recruited. There will rarely be difficulty in determining or recording a date of death but for others, even with everyday clinical measures, it may be necessary to define carefully how these are to be taken. For example, a physician may only need to know for diagnostic purposes if the temperature of the patient is elevated, say beyond 37.5 C, whereas in a trial it may be important to record precisely the temperature as the trial may be investigating the change in these values following a specific intervention. In addition, it will be important to specify meticulously how (and when) the measure is to be taken, for example, the particular type of thermometer and whether by oral, thermal scan or rectal readings. If blood, urine or other samples are to be taken, once again ‘when’ will need to be specified but also, once obtained, the exact manner in which these are to be handled, stored and tested will need to be detailed. If specimens are analysed by a reference laboratory then their procedures too have to satisfy the trial requirements.

4.3

MEASURES AND ENDPOINTS

83

It is particularly important to assess carefully the implications of those measures which initiate a course of action if their value attains a certain level. For example, in a clinical trial of patients with burns, one may state that: patients are expected to be discharged from the hospital burns unit once their wound has healed to a sufficient degree.

However, how is ‘sufficient degree’ defined so that it can be unambiguously applied to each patient? In practice, it may also depend more on the support available ‘at home’ for the patient once discharged than on the intrinsic condition of the burn wounds themselves. In which case, care with this definition is required as it may lead inadvertently to the discharge of the patient and thereby prevent determination of the wound closure time needed for the purposes of the trial.

4.3.2 Endpoints The protocol for every trial will detail the assessments that are to be made and it is essential to identify and carefully define which of the measures taken are themselves regarded as an endpoint or are one of several component parts required to indicate or establish an endpoint. Amongst these, the primary endpoints have to be identified as these are used in trial size determination and in the principal analyses when the intervention groups are eventually compared. The precise specification of such endpoints clearly depends on the clinical or other specialty in which the trial is conducted. 4.3.2.1 Objective criteria In certain situations, there is not necessarily an obvious measure to take. For example, although one may regard tumour shrinkage as a desirable property of a cytotoxic drug when given to a patient, it is not immediately apparent how this is to be quantified. If every tumour were of regular spherical shape, the direction in which it is measured is irrelevant. Furthermore, the diameter, a single dimension, would lead immediately to the volume of the tumour. However, no real tumour will comply with this ideal geometrical configuration and this has led to measures such as the product of the two largest (perpendicular) diameters to describe the tumour and then a reduction in this product to indicate response. However, the response measure, used as a secondary endpoint, in the context of the trial conducted by Palumbo, Bringhen, Caravita, et al. (2006) is detailed by Bladé, Samson, Reece, et al. (1998). This document offers the necessary guidelines to encourage uniform reporting of outcomes in the context of blood and marrow transplants. Investigators of trials may argue about the fine details, and no doubt in time these guidelines will need revision, but in the meantime, they would be foolish to ignore these recommendations when conducting and subsequently reporting their trial.

84

4 MEASUREMENT AND DATA CAPTURE

If there are justifiable reasons why other criteria should be used, or the recommendations cannot be followed for whatever reason, then before the trial commences these should be reviewed by the investigating team. There is little point in conducting a trial using measures not acceptable to other groups, including the referees for the clinical journals, as little note will then be taken of the results. The best option is to follow the guidelines for the primary endpoint as close as possible, then using the other measures for secondary reporting and contrasting the two in any discussion.

Example 4.1

Subjective skin assessment to identify bed sores

Brown, McElvenny, Nixon, et al. (2000) described some practical issues arising during the conduct of a trial of post-operative pressure sore prevention, and indicated that the inclusion of reactive hyperaemia in the skin assessment scale of Figure 4.2, which they were using as the endpoint measure, was subject to some debate. Indeed, the independent committee monitoring the trial became concerned that a subjective measure had been chosen to quantify the endpoint.

Grade

Description of skin

0

No discolouration

1 2a

Redness to skin − blanching occurs Redness to skin − non-blanching area

2b 3 4 5

Superficial damage to epidermis broken or blistered Ulceration progressed through the dermis Ulceration extends into subcutaneous fat Necrosis penetrates the deep fascia and extends to muscle

Reactive hyperaemia

Figure 4.2 Skin assessment scale used for the grading of pressure sores. Source: Brown, McElvenny, Nixon, et al. (2000). © 2000, John Wiley & Sons

In some situations, less than optimal measures may have to be used. For example, although precise levels of pain experienced may be measured in the laboratory such methodology may not be practical when levels need to be assessed at the bedside. A practical method of recording pain, or variables such as strength of feeling, is by means of a visual analogue score (VAS). A patient completes a VAS by making a mark on either a horizontal or vertical line to provide, once measured, an apparently continuous scale. It is useful for measuring aspects that may be difficult to put into words as when used to assess pain and in this context VAS behaves as if it is approximately linear (in the sense that a score of say 4 is twice as much pain as a score of 2). Also, because individuals tend to be internally consistent, VAS are good when measuring change within individuals, they are not so good when comparing across individuals.

4.3

MEASURES AND ENDPOINTS

85

Example 4.2 VAS – pain assessment In their clinical trial of patients with severe burns, Ang, Lee, Gan, et al. (2003) (see Example 2.2) used a VAS to assess the pain levels experienced by the patients. It is usual that the patients make such assessments themselves, marking their pain level experienced on a 10 cm VAS. However, when designing the trial, the authors anticipated that some patients would have burns which affect their ability to write easily, some would be too ill to complete the task, whilst others would have language and literacy issues. As a consequence, for this trial, the responsible nurse, used as necessary, the less refined verbal alternative administered in a local language or dialect familiar to the patient and assisted the patient to mark the scale when needed. It is clear from this example, that the trial design team were aware of the difficulties involved and made adjustment to their methodology in the light of these.

4.3.2.2 Single measure – one data item Single measure endpoints may include, for example, the birth weight of babies born in their own home. Others may be standard clinical measures such as systolic blood pressure (SBP) or the serum urea level obtained from a venous blood sample taken at a specific time post-randomisation. In the first example, exactly when this birth will occur will not be known whereas in the latter examples due notice of when the samples should be taken is established on the day of randomisation.

4.3.2.3 Single measure – several data items In many trials, the endpoint of concern may be a critical event, such as the complete closure of a burn wound or the death of a patient with cancer. In each case, there will be a single item on the data recording form: Healed (Yes/No) or Death (Yes/No). However, if duration is required to calculate ‘time to healing’ or ‘time to death’ then this is calculated as the difference between the date-of-the-event minus the date-of-randomisation. So, establishing this ‘time interval’ requires three data items. Similarly, in cardiovascular trials, the endpoint might be set as composite such as a Major Adverse Cardiovascular Event (MACE) defined as the first (if any) of: non-fatal stroke, non-fatal myocardial infarction, or cardiovascular death. There is a further complication in this instance as illustrated by Example 4.3.

86

4 MEASUREMENT AND DATA CAPTURE

Example 4.3

Time to develop a first (or new) pressure sore

The endpoint of ‘Development of new pressure sores’ of Example 3.5 requires more detail than that given in Section 9.1.1 of the PRESSURE (2000) protocol. The observation period is set as beginning at the date of randomisation to a mattress type and ending after 60 days. In that period the first (or first new) pressure sore might arise on a given date, from which the ‘time-to-sore’ event in days can be determined. However, should a patient be discharged or transferred to another location before the 60 days with no new pressure sore only the censored ‘timeto-these-moves’ can be determined. Further if the patient has no new sore by the end of the 60-day trial period, their censored ‘time-to-sore’ will be 60 days’. In such a case, alongside the associated date of the observation, the binary question: First sore (Yes/No) on the data form, would be extended to: First sore (Yes – sore present, Discharged/Moved without sore, No sore at 60 days).

4.3.2.4 Multiple measures In certain circumstances, there may be more than one possible location for the measure within a subject. For example, in determining whether or not a subject has glaucoma, the left, the right or both eyes will have to be examined. Similarly, there may be evidence of failure in the left, the right or both kidneys. An extreme example is whether or not each individual tooth is affected by caries. In many cases these may be reduced to a single primary measure such as the number of teeth with caries or an ordered categorical variable, indicating that 0, 1 or 2 eyes have evidence of glaucoma. On the other hand, it may be advantageous to keep these aspects as distinct. For example, if we were concerned with the resolution of eczema then in ‘moderate-tosevere’ cases there is likely to be more than a single (distinct) site concerned. Monitoring the progress of all sites may lead to a more efficient statistical design, but at the analysis stage, it is essential to make allowance for multiple sites being monitored within each patient. As these come from the same individual, it would be a mistake to regard these observations as independent. 4.3.2.5 Repeated measures In a trial taking repeated temperature assessments, these may be recorded in order to determine a single outcome – ‘time for the temperature to drop below 37.5 C’. In other situations, all the successive values of body temperature themselves may be utilised in making the formal comparisons. If the number of observations made on each subject is the same, and the intervals between successive observations is also the same for all subjects, then the analysis may be relatively straightforward. On the other hand, if the

4.3

MEASURES AND ENDPOINTS

87

numbers of observations recorded vary or if the intervals between successive observations differ from patient to patient or if there is occasional missing data, then the summary and analysis of such data may be quite complex. Such complexities of data structure often arise in trials using a Health-Related Quality of Life (HRQoL) outcome in patients who, for example, are terminally ill.

Example 4.4 Multiple sclerosis Comi, Pulizzi, Rovaris, et al. (2008) report for their trial of the effect of laquinimod on MRI-monitored disease activity in patients with relapsing-remitting multiple sclerosis, on the following complex measure: The primary outcome was the cumulative number of gadolinium-enhancing (GdE) lesions at weeks 24, 28, 32 and 36. Thus the investigators would need to ensure that the four assessments are indeed made and that the analysis team ensure that these are summed. Some guidance as to what might happen if an assessment is not made is required.

4.3.2.6 Multiple endpoints If there are many endpoints defined, the multiplicity of comparisons made at the analysis stage, may result in spurious statistical significance. This is a major concern if endpoints for HRQoL and health economic evaluations are additional to the already defined, more clinical, endpoints. As we have indicated, for design purposes it is essential to focus on the major (and few) key endpoints and it is these same endpoints that provide the focus at the analysis and interpretation stages once the trial is complete. Any secondary level endpoints should be identified as such at the planning stage and the manner in which they are to be summarised and reported indicated. Often less formal statistical comparisons will be made of these than for the principal endpoints. 4.3.2.7 Laboratory measures In randomised trials involving laboratory endpoints, it is important that the same test kit is used for measuring the outcomes throughout the duration of the trial. In practice, operational issues may result in the improvement over time in the design of the test kit currently in use and thereby allowing this to be replaced by the new kit to enable a more sensitive assay. In such instances, the later measurement values may not be immediately comparable to those made earlier. One method to overcome this difficulty is to standardise the measurements within each of the periods concerned using a Z-score defined for an individual patient laboratory measure, xi, taken within that period by Zi =

xi − μLab σ Lab

(4.1)

88

4 MEASUREMENT AND DATA CAPTURE

Here μLab and σ Lab, are the mean and standard deviation of the Standard Assay of the reference sample as used by the laboratory concerned in the period within which a particular specimen is analysed. Once the Standard Assay is changed then the laboratory adopts a New Standard Assay described by μNew and σ New. From then on, subsequent individual samples tested will be calculated using the form of Equation (4.1) but with the new laboratory values in place. Should there be more than one change in the Standard Assay utilised during a clinical trial, then the standardisation process would be repeated. The eventual analysis, following closure of the trial, would then use the mean Zi to summarise each intervention group appropriately and ultimately use these summaries to compare the groups concerned. In a similar way, anthropometric indicators from children are sometimes adjusted to a Z-score using the WHO (2006) child growth standards. Thus, in randomised trials, such as that of Daniels, Mallan, Nicholson, et al. (2013) concerned with interventions to influence early feeding practices and thereby prevent childhood obesity, Z-scores were calculated for each child concerned. The transformed data were then used to make comparisons between the interventions concerned.

Example 4.5 Laboratory assay – estrone levels In the COMPLIANCE (2015) trial of Figure 3.2 Tan, Wong, Tan, et al. (2020) designed to test medication adherence with respect to use of the adjuvant aromatase inhibitor in women with breast cancer, a secondary objective was to evaluate whether the inhibition process would result in lower estrone levels at 1-year in those receiving a Short Message Service (SMS) reminder as compared to Control of no reminder. At the commencement of the trial, Liquid ChromatographyTandem Mass Spectrometry was used for measuring estrone. However, it was later discovered that the lowest detection limit of estrone for this instrument was 10 pg/mL, and hence exact levels could not be recorded. Advice on more sensitive assays suggested the use of the ELISA kit from DRG International for subsequent tests between August 2015 and December 2016. However, changes to ELISA were subsequently made and the revised kit used from January 2017. The estrone measurements of the new and earlier batches using ELISA differed considerably with the mean and the standard deviation (SD) from the old test kit being substantially higher than those of the updated ELISA as shown in Figure 4.3. The Z-scores at 1-year may be similarly calculated for the different batches and subsequently combined within each treatment to evaluate the effect of intervention on estrone levels by comparing the mean Z-scores using a t-test. An alternative method of accounting for the batch-to-batch variation may be to estimate for each batch, the mean difference in estrone levels between treatments and the 95% CI, and obtain a pooled estimate of the mean difference by taking account of the different batch sizes.

4.3

MEASURES AND ENDPOINTS

89

Example 4.5 (Continued) Estrone levels (pg/mL) Control

Z-score

SMS

Reference

Control

SMS

15

16

64

15

16

Mean

138.2

159.0

72.94

1.92

2.54

2015

SD

38.9

65.4

33.89

1.15

1.93

Batch 2

n

17

19

14

17

19

Jan to Sep

Mean

43.0

37.2

24

2.41

1.67

2017

SD

25.6

23.2

7.9

3.24

2.93

Batch 1

n

Aug to Dec

Figure 4.3 Batch baseline mean and SD of estrone levels, and corresponding Z-scores, as measured by ELISA by intervention group

4.3.2.8 Surrogates There are times when it may be that the true endpoint of interest in a clinical trial is difficult to assess for whatever reason. In this case, a ‘surrogate’ may be sought (see the Glossary). For example, when investigating the possibilities of a novel marker for prognosis it may be tempting to use event-free survival (EFS) as a surrogate endpoint for the overall survival (OS) time of patients with the cancer concerned. An advantage is that for many cancers relapse occurs well before death and so the evaluation of the marker can occur earlier in time than would the case if OS was to be the endpoint. A formal definition of a surrogate endpoint is: a biomarker (or other indicators) which is intended to substitute for a (often) clinical endpoint and predict its behaviour. If a surrogate is to be used, then there is a real need to ensure that it is an appropriate surrogate for the endpoint of concern. In most trials, the relevant endpoint will be an outcome that patients also perceive as being relevant, despite some investigators being keen to establish efficacy by using more sensitive biomarkers that can detect changes not reflected by clinically important benefits. A distinction is made between an ‘intermediate’ and ‘surrogate’ endpoint by Parmar, Barthel, Sydes, et al. (2008), who discuss design options for speeding up the evaluation of new agents in cancer. Such an outcome, for example disease-free survival (DFS), is required to be related to the primary outcome measure but does not have to be a true surrogate. It can also be argued that the time experiencing DFS is highly relevant to patients, as time spent in progression when the disease has returned is more likely to be associated with suffering.

90

4 MEASUREMENT AND DATA CAPTURE

4.3.3 Patient self-reported outcomes Most randomised controlled trials are intended primarily to address simple endpoint questions of efficacy. However, sometimes other objectives such as HRQoL or other patient-reported outcomes (PROs) are particularly important, for example, in patients with chronic conditions who are terminally ill. The measurement process is now by means of one or more HRQoL instruments perhaps applied repeatedly over time. If a single domain of a single HRQoL instrument, measured at one time point, is to be used for intervention comparison purposes then no new principles are required, either for trial design purposes or analysis. On the other hand, and more usually, there may be several aspects of the instrument that need to be compared between interventions and these features will usually be assessed over time. This is further complicated by often-unequal numbers of assessments available from each patient, caused by missing assessments that may, for example, arise for reasons related to a patients’ changing health status (termed informative missing) or may be missing at random. Although design principles may not change to a large degree, logistical problems are magnified. These include determining the schedule for when the HRQoL assessments are to be made and by whom – the patient or the carer (this may be instrument-specific), checking that all questions are completed and dealing with the large quantity of data at the analytical and reporting stages. These instruments are developed according to a very strict series of procedures and, in general, cannot be quickly developed just for the trial in hand. Fayers and Machin (2016) discuss these and other features of HRQoL data in some detail. 4.3.4 Economic evaluation As with HRQoL, there may be circumstances where an economic evaluation of the relative merits of two treatments is required. This may be particularly important if non-inferiority is to be demonstrated or if the relative costs associated with particular modalities are difficult to quantify. If we were to design a trial primarily to compare costs associated with different treatments, we would follow the basic ideas of blinding and randomisation and then record subsequent costs incurred by the patient and the health provider. A very careful protocol would be necessary to define which costs are being considered so that these are measured consistently for all patients. However, most trials are aimed primarily at assessing efficacy. A limitation of investigating costs in the setting of a clinical trial is that the schedule and frequency of visits by the patient to the physician may be very different to what they would be in routine clinical practice. Typically, patients are monitored more frequently and more intensely in a trial than in routine clinical practice. The costs recorded, therefore, in a clinical trial may well be different (probably greater) than in clinical practice.

4.4

4.4

MAKING THE OBSERVATIONS

91

Making the observations

As we indicated in Chapter 1 and Figure 1.2, and stress again here, it is important that an optimal level of blinding or masking should be sought for all those involved with the trial. It is particularly important that any assessments of the endpoints concerned are as objective as possible. Any investigator deeply involved in a trial, of whatever type, will be aware of the hypothesis under investigation and knowledge of the intervention allocated to a patient may influence (however unintentionally) his or her assessment. Consequently, if at all possible, assessments should be made by persons, or by some means, with no knowledge of which intervention has been allocated. If fully ‘blinded’ cannot be achieved, then steps should be taken to blind as many aspects of the measurement process as possible. For example, if the recipient of the intervention can be blinded then whoever makes the assessment cannot be informed of the specific treatment by the subject. Further a nurse, who takes a blood sample, is best blinded to the intervention, but even if not, it is desirable that the sample when processed by the laboratory should be assessed blindly there. Although once at the laboratory for analysis there may be no difficulty in ensuring the objectivity of the measurement process, if the sample is labelled in such a way as to indicate the values of the measures anticipated, then the measurement and recording process could be biased in some way by this knowledge. For a laboratory sample, blinding may be easy to achieve while in other circumstances such as taking the pain assessment in patients with two types of dressing for their burns, this may not be possible. In this case the dressing type utilised cannot be disguised from anyone. However, swabs taken from the wounds to assess the presence of methicillin-resistant Staphylococcus aureus (MRSA) can be sent to the laboratory for testing in a coded format to ensure objectivity at that level. 4.4.1 Which observer In certain circumstances, it has to be made very clear who is an appropriate observer. If a HRQoL instrument is being used to assess patients then there are clear guidelines provided by Young, de Haes, Curran, et al. (2002) for the clinical trials of the European Organization for the Research and Treatment of Cancer (EORTC). These describe the manner and circumstances in which the instruments should be completed. The patient is the ‘observer’ and is supposed to complete the instrument, in this case, the EORTC QLQ-C30, him or herself. Only in specific circumstances can a proxy be used for this purpose and this must then be recorded in the trial documentation. Such a situation was highlighted in Example 4.2 describing VAS pain assessments in patients with severe burns. 4.4.2 Precision A question often arises as to whether a continuous variable should or should not be recorded as a categorical variable for data recording and so future analysis. For example, is it better to classify the variable SBP into three separate categories say, hypo-tension

92

4 MEASUREMENT AND DATA CAPTURE

(SBP < 110 mmHg), normo-tension and hyper-tension (SBP > 160 mmHg) rather than bother with their individual blood pressures? The difficulty here is that despite the relative ease of coding, the categories are not so intuitive if recorded as 1, 2 or 3 (say) and this may increase the risk of a recording error. What is more, one is stuck with the definitions used at the onset and should they be required to change (perhaps others have used a different categorisation) then comparisons between trials are going to be difficult. It is usually best to record the variable as precisely as is reasonable. Most individuals know their date of birth and the experimenter knows the date of the enquiry so that age can be easily computed at a later stage. It could then be rounded to convenient categories by creating a new variable within the database while preserving the two dates indicated. However, when describing an endpoint variable, the direct use of the continuous variables themselves, rather than the same variable categorised, is statistically more efficient. The effect of grouping data is that the design will require more subjects than would be the case if the endpoint variable is utilised in its continuous form. If the underlying variable is continuous, then the precision with which this is to be measured has to be defined. This will depend on the ‘ruler’ available for the measuring process. Furthermore, it is common to find that observers show digit preference, such that one invariably finds that the last digits of a particular measure tend to be rounded to a ‘nice’ number such as 0. One good solution is to ask all observers to go to 1 decimal place further than the trial actually requires and leave the rounding process until the computational stage.

4.5 Baseline measures There are three kinds of baseline assessments that might be made and recorded on a pre-treatment pre-randomisation form. These are the baseline (descriptive) characteristics of the patients themselves together with relevant prognostic factors. In addition, there may be a single baseline assessment of what are one or more intended endpoint(s) measures before the intervention has been allocated. An example is the baseline assessment of HRQoL using a patient-completed pre-randomisation HRQoL questionnaire. Finally, and most commonly used in cross-over trials of Chapter 13, a run-in series of outcome measurements may be made up to and including the baseline (immediately prior to randomisation) measure. As in the example of a pre-randomisation assessment of the SASSAD score in patients with eczema discussed in Example 1.7, the variable which is identified as the endpoint of concern may be assessed not just at a single time point but on several occasions. In some circumstances, there may be several pre- and several postrandomisation measures implicated. The first series of measures may represent a run-in period before initiating an active intervention, which is quite common in cross-over trials. This information may be used to assess the degree of within subject variation or more usually to define a (pre-intervention) level which can be compared to that achieved post intervention. There are several ways in which a baseline measure can be used as an indicator of the effect of the intervention on the subject concerned. For a single observation measure post-intervention, one method is to calculate the percentage change. A second is to

4.6

DATA RECORDING

93

use the difference in scores for each patient as the unit for analysis. However, these methods are statistically sub-optimal and, in general, are not recommended. The more efficient approach is to use the baseline measure itself to extend the model of Equation (2.1) and implement a regression approach to analysis. Some of these issues are discussed further in Chapter 8.

4.6

Data recording

4.6.1 Forms The number and types of forms required will depend on the particular features of the trial but will usually consist of those to record (i) baseline information concerned with patient description and eligibility, (ii) details necessary for the randomisation to be effected and the actual allocation made, (iii) information while the intervention is being administered, (iv) details following the completion of the intervention and (v) endpoint information. Forms are generally of two broad types – single and repeated. A single form may be that describing the ‘On trial’ characteristics of the participants while a repeated form may be one completed on several occasions during the active treatment period, perhaps four-weekly after each course of treatment. In this latter case, it is clear that a key component of this information is the date upon which the patient is examined in order to furbish the specific entries onto the form. Forms are used to record factual information such as a subject’s age (better date-ofbirth) and gender for descriptive purposes, pertinent details to confirm eligibility and, once randomised, the intervention group. They are commonly used in clinical trials to follow a patient’s progress and are often completed by the investigating team. For forms, the main requirement is that each form is clearly laid out and all investigators are familiar with it. However, even if all the data are to be collected by a single investigator it is still important that this is done in a clear and unambiguous way. Clarity of the experimental record with respect to the observations taken is becoming a routine feature of Good Clinical Practice (GCP) that must be adhered to in clinical trials for regulatory purposes. As variables and their names will need to be included in a database for further analysis, copies of data forms provide a good aide-memoire for a trial conducted some time ago. 4.6.1.1 Layout A balance between a cramped and cluttered layout, and a well-spaced but bulky series of forms has to be made. Each form should contain clear instructions about how to respond to each question. Sometimes more than one response to a question is possible. It is important to make clear whether a single answer is expected, or whether multiple responses are acceptable. It may seem obvious, but questions and possible answers should be kept together; one should avoid having the question on one page and the response options on another.

94

4 MEASUREMENT AND DATA CAPTURE

Example 4.6

Form design – breast cancer

He, Tan, Wong, et al. (2018) used the (single) form of Figure 4.4 to register and randomise patients to their clinical trial of comparing a short message reminder (SMS) with standard care in reminding women with breast cancer to take their adjuvant aromatase inhibitor. The top sections of the form were completed by the clinical team before contacting the central randomisation office who, once details were confirmed, provided the trial number (unique for each patient) and the allocated intervention.

NUS National University of Singapore

NUH

COMPLIANCE TRIAL

National University Hospital

REGISTRATION FORM

Saw Swee Hock School of Public Health

Short Message Service Reminder versus Standard Care In Breast Cancer Please return this form immediately after randomisation to: Saw Swee Hock School of Public Health, National University of Singapore, Tahir Foundation Building, 12 Science Drive 2, #10-03F, Singapore 117549.

The checklist on patient’s eligibility is to be completed before randomisation: Yes

No

1. Been receiving adjuvant endocrine therapy for at least one year, and continuing to receive adjuvant AI therapy for at least one more year. 2. Age 21 to 80 years. 3. Have cellular phone that can receive text messages. 4. Singaporean or permanent resident who is currently residing in Singapore 5. Able to give informed consent. 6. Unable or not willing to comply with study procedures.

N.B. If any of the shaded boxes are ticked the patient is ineligible.

RANDOMISATION Trial number

Date of randomisation: day

Allocated Treatment:

1. SMS reminder

month

year

2. Standard care

Figure 4.4 Registration form. Source: Modified from He, Tan, Wong, et al. (2018).

4.6

DATA RECORDING

95

4.6.1.2 Closed questions Closed questions can be answered by completing the answer in a relevant box or, as for the patient eligibility questions in Figure 4.4, ticking confirmation. When constructing responses to closed questions it is important to provide a clear range of replies. For example, this form provides a single box for the (closed) response to the question concerning ‘Allocated Treatment’; with permitted responses of either ‘1’ or ‘2’ corresponding to ‘SMS reminder’ or ‘Standard care’. If this form were administered on a computer screen, entry procedures could be designed to prevent anything but a 1 or a 2 being entered in this space. This cannot be achieved using a paper-based system. Forms may require a numerical answer so that in the (single) paper form of Figure 4.5, which is part of the trial documentation of Tang, Eu, Tai, et al. (2001), the responses are 1 (for No) or 2 (for Yes) to, for example, the question: ‘Abdominal bladder injury?’ although it is not totally clear how the boxes for “Trocar injury” and “Instrumentation injury” are to be completed. Alternatively, one might have to either tick boxes or circle the appropriate Yes/No response since asking the clinical teams to translate into numerical codes may be inviting errors. Clearly, different choice of response options could be available for electronically completed forms. On the data form of Figure 4.5, the closed question for ‘Volume of blood lost’ provides the unit of measure required, here ml, and an appropriate number of boxes given for the numerical value to be recorded. If the variable to be recorded requires two decimal places, then a style of boxes on the form such as: □□□•□□ is a convenient reminder of this. Dates will also frequently need to be recorded and, because of the different conventions used and the transition over the millennium, it is important to indicate clearly the requisite (boxes) for day, month and year by, for example, dd.mm.yyyy. In the example shown, the layout with respect to day-month-year is very clear although not exactly of this format. However, the text would have been better set closer to the boxes concerned. 4.6.1.3 Open questions In an open question, respondents are asked to describe in their own words. For example, in Figure 4.5 there is also an open question: ‘Other complications (please specify)’. In the context of this clinical trial, any responses would be expected to be of one or two words only from the investigating team, whereas in other circumstances a full description may be expected. In general, in clinical trials, open questions are best avoided or at least kept to a minimum. 4.6.1.4 Follow-up or repeat forms Any form that has to be completed a repeated number of times for every subject within the trial by the clinical team, perhaps over an extended period and at infrequent intervals, needs special attention given to its design. Not only must the form clearly demark

96

4 MEASUREMENT AND DATA CAPTURE

NMRC

SURGERY AND INTRAOPERATIVE COMPLICATIONS FORM

National Medical Research Council

Please return this form, following surgery, to: NMRC Clinical Trials Office, #02-14 Bowyer Block, Singapore General Hospital, Singapore 169608 A patient sticker with the following information may be used for this section if available:

Patient’s name

Trial Number

Hospital

NRIC

To be completed by the surgeon as soon after surgery as possible Has a liver ultrasound been carried out less than 30 days before planned surgery?

1. No 2. Yes

If yes, date of liver ultrasound:

1. No

Liver metastases present? day

month

2. Yes

year

Did the patient experience any of the following complications during operation? Gross faecal contamination

1. No 2. Yes

Ureteric injury

1. No 2. Yes

Duodenal injury

1. No 2. Yes

Cardiac insufficiency/dysrhythmia

1. No 2. Yes

Pulmonary insufficiency (from pneumoperitoneum)

1. No 2. Yes

Significant intra-operative haemorrhage (ie haemorrhage that requires peri-operative transfusion) Other complications (please specify)

Surgical emphysema

1. No 2. Yes

1. No 2. Yes

Volume of blood lost:

1. No 2. Yes

Small bowel injury

1. No 2. Yes

Major vessel injury

1. No 2. Yes

ml

Specify Trocar injury

Abdominal bladder injury

1. No 2. Yes

Instrumentation injury

PLEASE TURN OVER

Figure 4.5 Surgical complications form. Source: Based on Tang, Eu, Tai, et al. (2001).

the variables it wishes to capture, it must also make very clear the precise scheduling of the patient examinations that are necessary. Thus the follow-up form of Figure 4.6 used by Wee, Tan, Tai, et al. (2005), in their trial which recruited patients with nasopharyngeal cancer, specifies in the header the following instructions:

4.6

DATA RECORDING

97

SQNP01: FOLLOW FORM Radiotherapy versus Concurrent Chemoradiotherapy Followed by Adjuvant Chemotherapy for Locally Advanced Non-Metastatic Nasopharyngeal Cancer Patients are required to be followed-up 4 monthly for the first year, 6 monthly for 2 years, and annually thereafter, the anniversary date of follow-up being taken as the last day of RT. Please return this form immediately after each follow-up to: NMRC Clinical Trials Office, #02-14 Bowyer Block, Singapore General Hospital, Singapore 169608. Fax: 324 0417

Patient Name

Trial Number

NRIC

Hospital

Section A Date of follow-up day Clinical Status (please circle)

month 1. 2. 3.

year Primary Neck Distant

NED / Static / Equivocal / Prog* NED / Static / Equivocal / Prog* NED / Static / Equivocal / Prog*

*If progression please fill in a relapse form

Section B (To be filled at *4 months and annually after the last date of RT)

*CT Scan PNS

Chest X-ray day

month

year

Findings

day

month

year

Findings

Section C If the patient has died:

Date of death day

month

year

Cause (please tick): Disease related

Signed

Treatment related

Other

Specify:

Date

Figure 4.6 Follow-up form. Source: Based on Wee, Tan, Tai, et al. (2005).

98

4 MEASUREMENT AND DATA CAPTURE Patients are required to be followed-up 4 monthly for the first year, 6 monthly for 2 years, and annually thereafter, the anniversary date of follow-up being the last date of RT.

Although this schedule is clear there is nevertheless some ambiguity in this form over what is intended with respect to the schedule for the two components of Section B. 4.6.2 Questionnaires In certain situations, a form may be in the format of a questionnaire to be completed by a subject recruited to the trial. The distinction between a form and a questionnaire is that a questionnaire is an ‘instrument’ for measuring something, perhaps the HRQoL status of a patient, whereas a form is merely for recording information. In practice, a questionnaire may contain ‘form-like’ questions such as asking for gender and date of birth as well as the instrument variables themselves. A questionnaire may try to measure personal attributes such as attitudes, emotional states and levels of pain or HRQoL, and is often completed by the individual concerned. A convenient distinction between forms and questionnaires (although by all means not always the case) is that the investigating team completes forms while the trial participants themselves complete questionnaires. Thus, forms can be short and snappy, and any ambiguities explained amongst the investigators. In contrast, questionnaires need to be very carefully designed, particularly with respect to the choice of language they use to pose the questions. For example, technical jargon, like that scattered throughout the form of Figure 4.5 is suitable for a specialist surgical team but should be avoided in a questionnaire. 4.6.2.1 Layout As with forms, the questionnaire should have clear instructions and an attractive layout. It helps to reduce bulk by copying on both sides of a page and reducing the size of text to fit a booklet format. However, if those completing the questionnaire are older, with poorer eyesight, and where they sit to complete the form may have inadequate lighting it may be necessary to increase the size of the text. It is generally held that shorter questionnaires achieve better response rates than longer ones. However, it is hard to define what is ‘too long’ so that piloting of what is intended before the trial commences may be advisable to determine how acceptable the burden for the trial patients the questionnaire(s) is likely to be.

Example 4.7 Question layout – sexual function after gynaecological cancer Jensen, Klee, Thranov and Groenvold (2004) developed a questionnaire to evaluate sexual function in women following treatment for a gynaecological malignancy as a preliminary to a longitudinal trial. Part of their final questionnaire, developed following the process described by Sprangers, Cull and Groenvold (1998), is reproduced in Figure 4.7.

4.6

DATA RECORDING

99

Example 4.7 (Continued) Physical contact and sexual relations can be an important part of many people’s lives. People who suffer from illnesses involving their pelvic region may experience changes in their sex life. The questions below refer to this. The information you provide will remain strictly confidential. Please answer all the questions yourself by circling the number that best applies to you

Part 1 During the past month: 1. Have you been interested in close physical contact (a kiss and a cuddle)?

2. Have you had close physical contact with your family and close friends?

3. Have you had any interest in sexual relations? 4. Do you have a partner? (If not, please continue to question 8)

Not at all

A little

Quite a bit

Very much

1

2

3

4

1 1 Yes

2 2 No

3 3

4 4

1

2

Figure 4.7 Part of the Questionnaire (SVQ) for self-assessment of sexual function and vaginal changes after gynaecological cancer. Source: Jensen, Klee, Thranov and Groenvold (2004). © 2004, John Wiley & Sons

For questionnaires, particularly the ones trying to evaluate something like HRQoL the pragmatic advice is, if possible, do not design your own, use someone else’s! There are a number of reasons for this apparently negative advice. First, use of a standardised format means that results should be comparable between trials. Secondly, it is a difficult and timeconsuming process to obtain a satisfactory questionnaire. Thirdly, standard instruments will usually have been developed by a team that includes researchers from a wide range of disciplines, and the instrument should have undergone a lengthy validation process. 4.6.2.2 Closed questions Although a questionnaire may include some form-like closed questions such as asking for the gender of the participant, there may be others eliciting less directly measurable information.

Example 4.8

Closed question – sexual function after gynaecological cancer

Jensen, Klee, Thronav and Groenvold (2004) ask on the SVQ: ‘Have you had close physical contact with your family and close friends?’ For this closed question the response options are: (1) Not at all, (2) A little, (3) Quite a bit, (4) Very much. However, a trial participant may object to being forced into a particular category, and simply not answer the question as a result. Patients may be very elderly, and it is not clear how someone should respond if all their family and close friends are deceased.

100

4 MEASUREMENT AND DATA CAPTURE

One type of closed question, termed a Likert scale, is one that makes a statement and then asks how much the respondent agrees or disagrees.

Example 4.9 Likert scale – SF-36 Ware, Snow, Kosinski and Gandek (1993) use Likert type questions in the general health section of their SF-36 quality of life instrument as shown in Figure 4.8. 11. How TRUE or FALSE is each of the following statements for you? Definitely true

Mostly true

Don’t know

Mostly false

Definitely false

I seem to get sick a little easier than other people

1

2

3

4

5

b

I am as healthy as anybody I know

1

2

3

4

5

c

I expect my health to get worse

1

2

3

4

5

d

My health is excellent

1

2

3

4

5

a

Figure 4.8 General health question from SF-36v2. Source: Based on Maruish, Maruish, Kosinski, et al. (2011).

This format has the advantage of being compact, and there is little chance of people filling in the wrong bubble. However, some questionnaires avoid central categories such as the ‘don’t know’ of the SF-36. 4.6.2.3 Open questions Just as with forms, open questions pose difficulties. Thus, although they allow free rein to the participant to explain their response as they wish, this brings problems of data summary for the investigating team if the number of participants is more than a few. 4.6.2.4 Response bias One problem associated with asking questions is to know the extent to which the answers provided by respondents are valid. In other words, do their responses truly reflect their experiences or attitudes? If they do not, what are the causes of bias? Response bias can arise in any one of a variety of ways. One such bias is ‘social desirability bias’ and this arises particularly in respect of questions on sensitive topics. It

4.8

GUIDELINES

101

occurs when respondents conceal their true behaviour or attitudes and instead give an answer that shows them in a good light, or is perceived to be socially acceptable. Respondents’ answers may also differ according to who is asking the questions. ‘Recall bias’ affects questions involving recall of past events or behaviour, and may result in omission of information or misplacing an event in time. Biases can also arise from the respondent mishearing or misunderstanding the question or accompanying instructions. Questionnaire wording and design can also induce response bias, for example through question sequencing effects (where the response given to a particular question differs according to the placement of that question relative to others) and the labelling and ordering of response categories.

4.7

Technical notes

Just as we had Equation (2.1) to indicate the model underlying structure for design purposes, we can describe aspects of the measurement process in a similar way. Thus, we can express the measurement we are making in the following way, x=X+η

(4.2)

Here X is the true value of the reading that we are about to take on a trial participant. After we have made the measurement, we record X’s value as x. We know, with most measures we take, that we will not record the true value but one we hope that is close enough to this for our purpose. We also hope, over the series of measurements we take (one from each participant), that the residual (or our error) η = x − X of Equation (4.2) will average out to be small. In which case, any errors we make will have little impact on the final conclusions. However, if there is something systematically wrong with what we are doing (possibly something of which we are quite unaware), then the model we are concerned with becomes x=B+X+η

(4.3)

This second model implies that even if we average out η to be close to zero over the course of the trial, we are left with a consistent difference, B, between the true value X and that we actually record, x. This is termed the bias. Thus, when taking measurements, we should try from the outset to ensure that B = 0.

4.8

Guidelines

ICH E6 (R2) (2016). Guideline for Good Clinical Practice. EMA/CHMP/ICH/135/ 1995. Of particular relevance to this chapter is Section 5.5 Trial management, data handling, and record keeping.

102

4 MEASUREMENT AND DATA CAPTURE

ICH E9 (R1) (2018). Statistical Principles for Clinical Trials. CPMP/ICH/363/96. In particular this comments on the use of surrogate variables. Young T, de Haes H, Curran D, Fayers PM, Brandberg Y, Vanvoorden V and Bottomley A (2002). Guidelines for Assessing Quality of Life in EORTC Clinical Trials. European Organisation for Research and Treatment of Cancer (EORTC), Brussels. This is a very practical guide, the value of which is not confined to those conducting clinical trials in cancer.

CHAPTER 5

Randomisation

The method of choosing which intervention is assigned to a particular subject is an essential feature for maximising the useful information from a clinical trial. We give the rationale for why a random element to the choice is desirable and describe how random numbers may be used to assist the implementation of this. We describe how interventions may be grouped into randomised blocks such that the balance between interventions set by the design is maintained. Situations where there may be gain in recruiting a larger number of patients to one group than the other are introduced.

5.1

Introduction

As we indicated in Chapter 2, in any clinical trial where we intervene in the natural course of events, a decision has to be taken as to which intervention is given to which individual. In general, whatever the basic design, one should choose the structure of the design to answer the question posed and then make the choice of intervention as ‘random’ as possible. A key reason to randomise the alternative interventions to patients is to ensure that the particular one chosen for the individual patient is not predictable. Thus, the trial protocol will describe in careful detail the type of subjects eligible for the trial and, for example, the options under test. Only subjects for whom all stated options are appropriate should be entered into the trial. Thus, for whatever reason, if one option appears to be less favourable for the subject who is being assessed for recruitment then he or she should not be entered into the trial. Consequently, if the intervention intended is known in advance (or at least thought to be predictable) during the eligibility decision verification process, this knowledge may (perhaps subconsciously) influence the clinical team’s decision to include (or exclude) that individual from the trial. Thus, they might not judge fairly whether or not each of the options is appropriate for the particular subject under consideration. Any prior knowledge or accurate predictions of the intervention to be given may also compromise the informed consent procedure as the knowledge may lead the assessment team to a more selective description of the options available within the trial setting, thereby focussing more on the intervention that they anticipate will be given and less Randomised Clinical Trials: Design, Practice and Reporting, Second Edition. David Machin, Peter M. Fayers, and Bee Choo Tai. © 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.

104

5 RANDOMISATION

on the alternative(s). Thus, it is important that the investigating team are not aware of the treatment to be allocated to the next patient. Also, if the particular treatment intended is known by the patient, for example if may be the ‘new’ therapy, then a patient may more readily volunteer because that is the treatment that he or she will receive. On the other hand, if the patient knows they are to get the standard or traditional approach, they may be more reluctant to consent to enter the trial. Thus, any prior knowledge of the intended allocation by either the clinical team or the patient can introduce selection bias into the allocation process, and this will lead to unreliable conclusions with regard to relative efficacy of the interventions under test. In a randomised-controlled trial, patients are usually recruited one at a time and over a prolonged period, so that allocation to the intervention has to be made sequentially in time and then usually patient by patient. Thus, the randomisation process will continue until the last patient is recruited. There will be very few occasions in which all the patients are recruited to a clinical trial on the same day. As we have indicated in Section 2.7, it is best if the randomisation is made as soon as possible after consent has been given and that the intervention commences immediately thereafter.

5.2 Rationale In essence, the purpose of any experiment is to estimate the parameters of a model analogous to that of Equation (2.1). Thus, we collect data with this purpose in mind. We would like to believe that the estimates we obtain in some way reflect the ‘true’ or population parameter values. In principle, if we repeat the trial many times, then we would anticipate that these estimates would form a distribution that is centred on the true parameter value. If this is the case, our method of estimation is without bias. For example, in a clinical trial comparing two treatments, the parameter βTreat corresponds to the true difference (if any) in efficacy between them, and the object of the trial is to obtain an unbiased estimate of this. While taking due account of the trial design, the method of selecting which of the eligible patients is to be included in the trial does not affect this, but the way in which those patients, who are recruited to the trial, are then allocated to which particular intervention does. Of fundamental importance, here is the random allocation of subjects to the alternative treatments. Randomisation also provides a sound basis for testing the underlying null hypothesis, effectively testing if βTreat = 0 in Equation (2.1), by the use of a statistical test of significance.

5.3 Mechanics 5.3.1 Simple randomisation 5.3.1.1 Random numbers The simplest randomisation device is a coin which if tossed will land with a particular face upward with probability one half. Repeated tossing generates a sequence of heads (H) and tails (T) such as HHTHH TTHTH. These can be converted to a binary

5.3

MECHANICS

105

sequence 00100 11001 by replacing H by 0 and T by 1. An alternative method would be to roll a six-sided dice, and allocate a 0 for faces 1, 2 and 3 and 1 to faces 4, 5 and 6. To avoid using a coin or a dice for randomisation one can produce a computergenerated table of random numbers such as Table T1 based on the principle of throwing a 10-sided dice on many occasions with faces marked 0–9. Each digit is then equally likely to appear and cannot be predicted from any combination of other digits. The random numbers in Table T1 are grouped in columns of five digits merely for ease of reading. The table is used by first choosing a point of entry haphazardly, perhaps with a pin, but deciding in advance of this the direction of movement along the rows or down the columns. Suppose, the pin chooses the entry in the 10th row and 13th column, and it had been decided to move along the rows, then the first 10 digits give the sequence 534 55425 67 (highlighted in bold in Table T1). We cannot emphasise too strongly that the person producing such a randomisation list must conceal the list from the participating clinicians. Each allocation must only be disclosed after a patient has been recruited and registered into the trial. Otherwise, the whole merit of randomisation is lost because the next allocation is known (see Section 5.5).

5.3.1.2 Simple random allocation The first step in the simplest form of randomisation for a parallel two-group comparative trial is to assign one intervention (say) A to even numbers (for this 0 is regarded as even) the other intervention B to odd. The next step is to use the random numbers of Table T1 to generate the sequence of length N = 2m, where m is the planned number of recruits for each group. For example, using the previously chosen sequence 53455 42567 generates BBA BBAAB AB. Thus, the first 10 recruits involved will receive the interventions in this order, and once this is complete four individuals will have received A and six individuals B.

Example 5.1

Simple randomisation – chronic gastro-oesophageal reflux

Csendes, Burdiles, Korn, et al. (2002) recruited 164 patients with chronic gastrooesophageal reflux to their clinical trial using simple random allocation on a 1 : 1 basis to each treatment. This resulted in 76 randomised to receive fundoplication but 88 to calibration of the cardia. Thus, the achieved allocation ratio was 1 : 1.158, which is quite a disparity from that intended.

106

5 RANDOMISATION

5.3.2 Blocks As we have just seen, simple randomisation will not guarantee equal numbers in the different intervention groups. To ensure equal numbers, balanced arrangements can be introduced. This leads to various ‘restricted randomisation’ schemes. One way to achieve balance is by generating short ‘blocks’ that contain the combinations of the interventions, as follows. The block size is taken as a convenient multiple of the number of interventions under investigation. For example, a two-group design with a 1 : 1 allocation may have block sizes 2, 4, 6 or 8, whereas for a three-group design blocks of size 3, 6 or 9 would be appropriate. In addition, the actual block size is often also chosen as a convenient divisor of the planned trial size. For example, if the sample size required is N = 64, and with two interventions planned, a block size of 4 or 8 would be preferable to one of 6, since 4 and 8 are divisors of 64 but 6 is not. Blocks are usually chosen as neither too small nor too large so that for two intervention groups block sizes of 4 or 6 are often used. Suppose that equal numbers are to be allocated to A and B for successive blocks of four subjects. To do this, one first identifies amongst all the 16 possible combinations or permutations of A and B in blocks of 4, those that contain 2 A’s and 2 B’s. Here, we ignore those permutations with unequal allocation, such as AAAA and AAAB. The six acceptable permutations are summarised in Figure 5.1. These permutated blocks are then allocated the numbers 1–6, and the randomisation table used to generate a sequence of digits. Suppose, this sequence was again 53455 42567, then reading the first six integers from left to right we generate for the first 24 recruits the allocation BAAB, ABBA, BABA, BAAB, BAAB and BABA. However, note that within this sequence BAAB is repeated three times, and BABA twice while AABB is not used. To avoid this imbalance of sequences, a particular digit for the sequences of Table T1 should not be used a second time until all other relevant individual digits had first been used. In this case, the sequence ignores the repeated digits and any outside the range of 1–6 and becomes, in effect, 534-- -2-6-. This generates BAAB, ABBA and BABA as previously but now followed by permutations 2 and 6 of Figure 5.1, which are ABAB and BBAA. Finally, we note that permutation 1 has not been used so that AABB automatically completes the full 24-unit allocation sequence. Such a blocking device ensures that for every four successive recruits included in the trial, balance between A and B is maintained. Thus, once we have recruited 24 individuals, 12 will receive A and 12 receive B. In the event that recruitment has to cease before all 24 units are allocated then, at whatever point this occurs, there will be approximately equal numbers in each intervention group. 1 2 3

AABB ABAB ABBA

4 5 6

BABA BAAB BBAA

Figure 5.1 All possible permutations of block size 4 for two treatments A and B each occurring only twice

5.3

MECHANICS

107

If the design specifies 48 individuals are required to be randomised, then the sequences of Figure 5.1 are randomised again using the next digits from Table T1 which are as follows: 170 67937 88962 49992 etc. Working along this sequence results in the permutations 1, 6, 3, 2 and 4, and then 5 is taken to complete the set. In some circumstances, it may be desirable to avoid runs of the same treatment in successive patients, for example, there could be resource implications if different medical teams are responsible for the different treatments. Thus, if permutation 1 of Figure 5.1 is followed by permutation 6, then we have AABB and BBAA. This extended sequence of eight comprises a sub-sequence within it of four consecutive patients all assigned to B. To avoid this happening, if permutation 1 (AABB) is selected, then permutation 6 (BBAA) can be excluded as a possibility for the next block to be chosen so the reduced options are permutations 2, 3, 4 or 5. If then permutation 4 is chosen, the sequence is extended by BABA. At the next step, permutation 6 is returned to be one of the remaining choices which are now 2, 3, 5 and 6. As can be imagined, the randomisation process can become very tedious and so it is usual to use a computer program for this. In general, such programs should enable the particular options, for example, the number of interventions and the block size specified for the planned trial, to be accommodated.

5.3.2.1 Randomised block designs (RBD) If Figure 5.1 is reformatted into that of Figure 5.2, then the so-called randomised block design (RBD) structure for the first 24 units randomised becomes more apparent. The contents of the blocks 1–6 are formed by randomising the six permutations as we have described. The basic structure of Figure 5.2 can be extended to fit the needs of any trial design. For example, if N = 48 then, as we have illustrated, the RBD is replicated a second time but with the permutations randomised again for this second usage. Equally, the basic structure can be adjusted to the numbers of interventions concerned. Thus, Figure 5.3 includes a RBD for t = 3 interventions A, B and C conducted in blocks of size b = 3, but over units of 18 patients. This design includes all six possible

Block Permutation

1 5

2 3

3 4

4 2

5 6

6 1

B A A B

A B B A

B A B A

A B A B

B B A A

A A B B

Figure 5.2 Randomised block design (RBD) for 24 units comprising two interventions (A, B) in blocks of size 4

108

5 RANDOMISATION

Block

1

2

3

4

5

6

A B C

B C A

C A B

A C B

B A C

C B A

Figure 5.3 Randomised block design (RBD) for 18 patients comprising three interventions (A, B, C ) in blocks of size 3

permutations of size 3. The eventual assignment of these permutations to the corresponding r = 6 replicate blocks will be made at random. 5.3.2.2 Variable block size The investigating clinical team, especially those members concerned with identifying suitable patients, should not be aware of the block size. If they come to know, as each block of patients nears completion, guessing the next treatment to be allocated may again lead to subconscious inclusion or exclusion of certain patients from the trial. In fact, simple permuted blocks are rarely used, but a number of subterfuges can be applied to make the scheme less apparent to the clinical teams. Thus, the potential difficulty referred to can be avoided by changing the block size (at random) as recruitment continues to reduce the possibility of a pattern being detected by the investigation team. Clearly, this may not be an issue in double-blind trials where neither the attending physician nor patient knows of the actual treatment given to earlier patients although it could become one if, for any reason, the blinding is not fully effective. For example, if one particular (although blinded) treatment produces a specific reaction in the patient but this is not observed with the others. So even here a variable block size might be a wise precautionary strategy. 5.3.2.3 Allocation ratio We have implicitly assumed that, for two interventions, a 1 : 1 randomisation will take place. However, the particular context may suggest other ratios. For example, if the clinical trial patients are limited for whatever reason, the design team may then argue that they should obtain more information within the trial from (say) the test intervention group, T, rather than from the well-known control or standard, S. In such circumstances, a randomisation ratio of say 2 : 3 or 1 : 2 in favour of the test intervention may be chosen. The first could be realised by use of a dice with sides 1 and 2 allocated to S, 3, 4 and 5 to T and ignoring 6. The latter ratio could be obtained by also using a dice but with sides 1, 2 for S and now 3, 4, 5 and 6 for T. However, moving from a 1 : 1 ratio involves some increase in trial size (Chapter 9), and this increase should be quantified before a decision on the final allocation ratio is made.

5.3

MECHANICS

1 2 3 4 5

109

SSTTT STSTT STTST STTTS TSSTT

6 7 8 9 0*

TSTST TSTTS TTSST TTSTS TTTSS

*Note 0 replaces 10 to facilitate the use of random number tables.

Figure 5.4 All possible permutations of block size 5 for two interventions S and T allocated in the ratio 2 : 3

If an allocation ratio of 2 : 3 for S and T is chosen, then this implies a minimum block size of 5 with each block comprising one of the permutations given in Figure 5.4. If a design for a trial comprised of N = 25 subjects, then the first sequence of random numbers that we used of 534 55425 67 generates, using Figure 5.4, the five blocks with the following permutations TSSTT, STTST, STTTS, STSTT and TSTST. The next block of 5 would then randomise the order of the unused permutations (0, 1, 7, 8, 9) of Figure 5.4 using (say) the sequence 170 67937 88962 49992 to obtain SSTTT, TSTTS, TTTSS and TTSTS, and finally the unused permutation 8, TTSST, would be added to complete the sequence.

Example 5.2 Unequal allocation – hepatocellular carcinoma In a double-blind trial conducted by Chow, Tai, Tan, et al. (2002), three doses of tamoxifen (0 placebo, 75 and 150 mg/m2) used the allocation ratios of 2 : 1 : 2. Consequently, the patients were stratified by the 10 recruiting centres and randomised within each centre using a fixed block size of b = 5.

5.3.2.4 Stratified randomisation As in the trial conducted by Poon, Goh, Kim, et al. (2006) of Example 2.1, trials often involve more than one clinical centre. In their case, there were single centres from India, Korea, Singapore and Thailand. In such multicentre circumstances, it is usually desirable to maintain the design-balance between treatments for each centre involved, whether 1 : 1 or in a different but pre-defined ratio. This implies producing a distinct randomisation list for each centre. In this way, the relative experience of treating patients with the trial options is shared by all centres concerned. A generic term, equivalent to ‘centres’ in the above example, is ‘strata’, and this general concept of strata can be extended to other situations. For example, in children with neuroblastoma, those with metastatic disease are known to do less well (are more likely to relapse following treatment) than those without. Consequently, if both metastatic

110

5 RANDOMISATION

Example 5.3

Stratified randomisation by recruiting centre – uncomplicated falciparum malaria

In the report of the trial conducted by Zongo, Dorsey, Rouamba, et al. (2007), the authors state that: Nurses were responsible for treatment allocation, on the basis of computer-generated randomisation list, which were stratified by the three clinic sites. These lists were provided by an off-site investigator who did not participate in administration of the trial.

and non-metastatic patients are recruited to a randomised trial then, although balance between treatments using randomised blocks may be maintained overall, there is no guarantee of balance within each of the distinct prognostic groups. This can be rectified by using metastatic disease status (absent or present) as strata and then randomising within each of these. In this way, balance is maintained within each metastatic group as well as for the trial as a whole. Further, if this trial is multicentre then for each centre involved the randomisation to treatment can be conducted within each metastatic disease group thereby ensuring the protocol specified allocation ratio of the interventions is closely maintained within each of the strata and also within the centre. This process can, at least in theory, be extended to more and more strata but this may not be sensible. The principal reason for stratifying by centre is to give each centre the same relative experience of all the interventions specified by the design. In contrast, the reason for stratifying by presence or absence of metastatic disease is because the disease status of the patient is a very important determinant of outcome. So should the balance of patient metastatic status be different in the two intervention groups, then this may distort the final treatment comparison. To give an extreme example, suppose by chance all the metastatic patients happened to receive the same treatment, say A, and none received B. Then, the true effect of A versus B will be compromised – in fact B may appear to have a clear advantage in terms of survival over A. However, had half the metastatic patients been included in B, and half in A, then group A would improve over the previous situation while B worsens so that difference between those receiving A and B would be less extreme. In our example of neuroblastoma in children, there are other factors (apart from treatment itself and metastatic status) which affect outcome, such as stage of the disease and age at diagnosis. However, further stratification makes the randomisation process more difficult and as the potential numbers of strata becomes large, there is a distinct possibility of only a few patients falling into some of the strata subgroups. Thus, if four centres were involved, with strata for metastatic status with two levels, and (say) stage with three levels then this creates 4 × 2 × 3 = 24 substrata so that the

5.3

MECHANICS

111

number of patients per treatment within the substrata (assuming equal numbers in each level occur within strata although this is unlikely to be the case) is N/48, which is clearly less than an average of 10 patients unless the trial size is 480 or more. Thus, stratified randomisation as a method of achieving balance can become unworkable if there are too many stratification variables. For continuous prognostic variables, such as age, stratification can only be done when these variables are divided into categories. There is then the difficult decision as to what cut-point of age to use. Thus, although age (or some other continuous variable) may be prognostic for outcome, it is usually preferable not to stratify for this but to record the information for each patient and take account of this in a retrospective sense at the analysis stage. If there are several potential variables for stratification, these should all be variables that are known or are believed to be influential on outcome in a substantial way. The choice then is to decide which will be the design strata that are taken account of at the randomisation stage, and which can be viewed as retrospective strata (to be recorded at the patient eligibility stage and then adjusted for in the final analysis). Centre will usually be added to the former group while continuous variables are usually best left for the latter. In whichever category the variables fall, it is also important (at the planning stage) to judge if the subsequent analysis taking account of these will be sufficiently robust. In broad terms, are there enough subjects in the trial to ensure that the appropriate statistical models can be estimated with reasonable precision? 5.3.3 Dynamic allocation by minimisation Stratification only works when the number of strata is few. As a consequence, a number of dynamic methods have been developed, of which the simplest is ‘minimisation’. Dynamic methods are used to replace the randomisation process with a largely deterministic method that is based on the characteristics of the patients already in the trial, and the characteristics of the patient about to be allocated to an intervention.

Example 5.4 Stratified randomisation – cranberry or apple juice for urinary symptoms Campbell, Pickles and D’yachkova (2003) used a simple 1 : 1 randomisation, within each of four substrata, to assign treatment either by Cranberry (CJ) or Apple (AJ) juice to alleviate urinary symptoms during external beam radiation for prostate cancer. Their allocation into the 2 × 2 stratification groups of previous transurethral resection of the prostate (TURP) (Negative or Positive) and International Prostate Symptom Score (IPSS) (|z| (95% CI) -------+-----------------------------------------------------AL | 0.1020 AQ + SO | 0.0172 -------+-----------------------------------------------------Diff | 0.0849 0.0211 3.89 0.0001 0.0435 to 0.1262 --------------------------------------------------------------

Figure 8.4 Edited commands and output for the analysis of parasite clearance. Source: Data from Zongo, Dorsey, Rouamba et al. (2007).

Example 8.8

NNT – cleft lip and palate, and oral lichen planus

As part of a randomised trial conducted by Williams, Seagle, Pegoraro-Krook, et al. (2011) in children with upper cleft lip and palate who received palatal surgery either by the von Langenback (vL) or the Furlow (F) technique. Satisfactory hypernasality was achieved in 0.7094 (144/203) of those children having vL and in 0.8150 (141/173) with F surgery. Thus, the estimated advantage to F is 0.1056 or 10.6% with a 95% CI ranging from a lower limit, LL = 0.0205 to an upper limit UL = 0.1908, p-value = 0.017. From which the NNT = 1/0.1056 = 9.5 or approximately 10. The corresponding 95% CI for the NNT ranges from 1/UL to 1/LL. Thus, from 1/0.1908 = 5.05 or 5 to 1/0.0205 = 48.71 or 49. In contrast, the small advantage of 0.0666 for patients with oral lichen planus by the use of S over C in Example 8.5 was not statistically significant. Also, the 95% CI included the null hypothesis of zero difference and ranged from −0.1006 to +0.2338. In this instance, although NNT = 1/0.0666 = 15.02 the CI cannot be obtained readily because of the negative LL as has been explained by Altman (2000, pp. 124–126). This suggests that the value of the NNT is best confined to situations where the 95% CI excludes the null hypothesis.

8.5

EXAMPLES OF ANALYSIS

161

8.5.5 Ordered categorical In certain situations, binary data, such as those we have analysed as a comparison of two proportions, may arise from an underlying categorical variable such as that summarised in Figure 8.5 in which response to treatment is assessed on a 6-point scale ranging from ‘Worse’ to ‘Complete resolution’. Thus, investigators, at the analysis stage, could define a satisfactory response as including ‘Complete resolution’ or ‘Striking improvement’ and then summarise these rates as 16/41 (39.0%) and 1/20 (5.0%). These proportions are then compared using the methods above. However, this is wasteful of the detailed information contained in the categories. In these calculations, those in the corresponding ‘not assessed’ groups are regarded as not having had an improvement. In practice, how such patients should be dealt with in the analysis would be specified in the trial protocol. A more useful approach to the comparison of the treatment groups is to use the Ranksum test, which involves ranking the individual patients by first combining the two sets of data. Once again however a decision as to how the ‘not assessed’ is to be utilised in the analysis needs to be pre-specified, possibly by all being given a middle ranking and extending the categories to 7, or omitted from the calculations entirely.

Example 8.9 Ordered categorical data – eczema Ignoring the six patients of Figure 8.5 who were not assessed, there remain nP = 19 patients in the P group and nA = 36 in A. Thus, a total of nP + nA = 55 patients are ranked. There is only one patient who had ‘Complete resolution’ and so has rank 1. In contrast, 16 had ‘Striking improvement’ so they share the ranks from 2 to 17 and are thus each assigned the average of these ranks (2 + 17)/2 = 9.5. Similarly, those 17 with ‘Moderate improvement’ are assigned (18 + 34)/2 = 26 and so on until the one patient who is ‘Worse’ has the lowest rank 55. The ranks are now apportioned amongst the treatment groups within each category and summed to give the entries in the final two columns of Figure 8.5. Note that WA + WP = (36 + 19)(36 + 19 + 1)/2 = 1540 in this case. From these, and using A as the Either group, W0 = 36(36 + 19 + 1)/2 = 1008 and SE0 = 36 × 19 × 36 + 19 + 1 12 = 3192 = 56.4978. Then, from Equation (8.10) with WA = 862.0, z = (862.0 − 1008)/56.4978 = −2.58. Use of Table T2, gives the p-value = 2(1 − 0.995 06) = 0.009 98. This suggests a statistically significant advantage to A over P in these patients.

162

8 BASICS FOR ANALYSIS

Example 8.9

(Continued) Sum of Ranks

Investigator assessed Placebo

Azathioprine

All

Ranks

Placebo

Azathioprine

Complete resolution

0

1

1

1



1

Striking improvement

1

15

16

2 to 17

9.5

142.5

Moderate improvement

8

9

17

18 to 34

208.0

234.0

Slight improvement

4

6

10

35 to 44

158.0

237.0

No change

5

5

10

45 to 54

247.5

247.5

Worse

1

0

1

55

55.0



Not assessed

1

5

6

WP =

WA =

Total

20

41

61

678.0

862.0

Figure 8.5 Investigator assessed response to treatment in patients with moderate-to-severe eczema. Source: Data from Meggitt, Gray, and Reynolds (2006).

The corresponding output for Example 8.9 from a statistical package is summarised in Figure 8.6. The statistical analysis command comprises the test to be performed (ranksum) followed by the endpoint variable (Grade) and the groups to be compared (Treat). The output differs from our calculations in that z = −2.674. However, modifications to the Ranksum test of Equation (8.10) should be made if there are many tied observations and hence have the same rank value. As a consequence, SE0 is modified to become √2981.73 = 54.6052 and so is reduced in size. This adjustment leads to a larger p-value = 0.0075 but this has little implication for the interpretation of the results in this example. Nevertheless, this is the appropriate procedure to use and fortunately statistical packages make the necessary adjustment automatically. An ordered categorical outcome arises if, for example, both eyes are affected by the condition of concern in a randomised controlled trial as (assuming the same treatment is given to each eye) the patient may respond to treatment in neither eye, one eye or both eyes. Thus, the possible pairs of response for the two eyes are (0,0), (1,0), (0,1) and (1,1). In this case, the outcome variable can be regarded as a three-level ordered categorical taking possible values 0, 1 or 2. Thus, each patient response now contributes a single observation. Comparisons between the randomised groups can then be made as in Figure 8.6.

8.5

EXAMPLES OF ANALYSIS

163

Analysis command ranksum Grade, by(Treat) Output Two-sample Wilcoxon rank-sum (Mann-Whitney) test -------------+--------------------------Treat | Obs Rank Sum Expected -------------+--------------------------Placebo | 19 678 532 Azathioprine | 36 862 1008 -------------+--------------------------Combined | 55 1540 1540 -------------+--------------------------Unadjusted variance Adjustment for ties Adjusted variance

3192.00 -210.27 ---------2981.73

Ho: Grade(Azathioprine) = Grade(Placebo) z = -2.674 Prob > |z| = 0.0075 Figure 8.6 Edited commands and output for the analysis of the response to treatment in patients with mild-to-moderate eczema of Figure 8.5. Source: Data from Meggitt, Gray, and Reynolds (2006).

This concept readily extends to situations where in patients with multiple skin lesions, a fixed number of target lesions, k, are identified for monitoring purposes in all patients. Each target is then scored 0 (no response) or 1 (response) and hence individual patient total response values can range from 0 to k. 8.5.6 Time-to-event Time-to-event data are characterised by observations that may be censored. Thus, some subjects, in whom the endpoint ‘event’ of interest has occurred, the actual survival time, t, is observed. However, other subjects for whom the ‘event’ has not yet occurred up to a point in their observation time, have censored survival times, T+. The analysis of these data, which involves either one of t or T+ for every subject, will involve Kaplan–Meier (K–M) estimates of the corresponding survival curves. Calculating the K–M survival curve, even with a small data set, is a tedious business. It has to be calculated using a large number of decimal places to avoid rounding error and is therefore prone to mistakes. The process first involves ranking the survival times of the group under consideration from smallest to largest. If in this ordered listing a t and a T+ take the same numerical value, that is, they are tied, then the censored

164

8 BASICS FOR ANALYSIS

observation is given the larger ranking. Alongside each rank, either 1 or a 0 is placed dependent on whether or not an event had occurred at that time. Essentially, the t will have a 1 attached and the T+ a 0. The simplest situation is when there are no censored observations as in Figure 8.7a in which all 15 die at the indicated time t, which represents the overall survival in years (OSy). The K–M survival curve commences at t = 0 when 100% are alive and continues horizontally until t = 0.088 years at which time one patient dies and so the survival curve drops by 1/15 and 14/15 (93.33%) remain alive. The curve then continues horizontally until t = 0.367 when 2 of these 14 patients die, thereby leaving (12/14) alive. At this time, 14 12 12 × = which is the product of those 12/15 (80.00%) remain alive. This equals 15 14 15 still alive after the two death times at 0.088 and 0.367 years. This process continues with the death at t = 0.663 and the K–M estimate, denoted S(t), is now 14 12 11 11 S 0 663 = × × = and so on until the final death at t = 7.159 when S 15 14 12 15 (7.159) = 0. This process is summarised in the Output of Figure 8.7a whilst Figure 8.8a gives a step-down plot of the Survivor Function against Time, or S(t) against t.

Analysis commands (a) No censored observations

(b) Three censored observations

stset Osy, fail (Dead)

stset Osy, fail (NewDead) Output

-----------------------------------Beg Net Survivor Time Total Fail lost function -----------------------------------0.088 15 1 0 0.9333 0.367 14 2 0 0.8000 0.663 12 1 0 0.7333 0.942 11 1 0 0.6667 1.095 10 1 0 0.6000 1.103 9 1 0 0.5333 1.974 8 1 0 0.4667 1.982 7 1 0 0.4000 2.166 6 1 0 0.3333 2.182 5 1 0 0.2667 2.292 4 1 0 0.2000 2.382 3 1 0 0.1333 3.953 2 1 0 0.0667 7.159 1 1 0 0.0000 ------------------------------------

-----------------------------------Beg Net Survivor Time Total Fail lost function -----------------------------------0.088 15 1 0 0.9333 0.367 14 2 0 0.8000 0.663 12 1 0 0.7333 0.942 11 1 0 0.6667 1.095+ 10 0 1 0.6667 1.103 9 1 0 0.5926 1.974+ 8 0 1 0.5926 1.982 7 1 0 0.5079 2.166 6 1 0 0.4233 2.182 5 0 0 0.3386 2.292 4 1 0 0.2540 2.382+ 3 0 1 0.2540 3.953 2 1 0 0.1270 7.159 1 1 0 0.0000 ------------------------------------

Figure 8.7 Edited commands and output for the analysis for calculating the Kaplan–Meier (K–M) survival curves of 15 patients with (a) no censored and (b) with censored observations

8.5

EXAMPLES OF ANALYSIS

165

(a) No censored observations

(b) Three censored observations

100

Overall survival (%)

75

1 1

50

1

25

0 0

1

2

3

4

5

6

7

8

0

Time from randomisation (years) At risk 15

10

6

2

1

1

1

1

1

2

3

4

5

6

7

8

Time from randomisation (years) 0

At risk

15

10

6

2

1

1

1

1

0

Figure 8.8 Kaplan–Meier (K–M) overall survival curves in years of 15 patients with (a) no censored and (b) with three censored observations

The laborious process described for calculating a K–M survival curve is not essential when there are no censored observations as the successive survival estimates reduce to simple fractions 14/15 = 0.9333, 12/15 = 0.8000, 11/15 = 0.7333, until 0/15 = 0.000 expressed as percentages. However, this long-winded process is necessary when censored observations are present. Once again, the K–M survival curve commences at t = 0 with 100% alive and con14 12 11 10 10 tinues as before until S 0 942 = × × × = = 0.6667. At this point, 15 14 12 11 15 10 patients remain alive. However, the next observation is censored at T+ = 1.095 years as no ‘event’ has occurred for this patient. So, beyond this time, there remain not 10 patients on follow-up and at risk of having an ‘event’ but only 9, so following 14 12 11 10 the next death at t = 1.103 leaves 8/9 alive. Thus, S 1 103 = × × × × 15 14 12 11 8 10 8 = × = 0.5926. There is a further censored observation at T+ = 1.974 so that 9 15 9 14 12 11 10 8 6 × × × × × = 0.5926 × 0.8571 = 0.5079 and this process S 1 982 = 15 14 12 11 9 7 continues until the final event at t = 7.159 when S(7.159) = 0.000. The corresponding analysis is summarised in Figure 8.7b, and we have highlighted in bold in the Survival Function columns where the two sets of calculations begin to diverge from Figure 8.7a. The step-down plot of S(t) against t for the censored example is the K–M survival curve of Figure 8.8b.

166

8 BASICS FOR ANALYSIS

Note in both the non-censored and censored situations, the final values of S(7.159) = 0.000. This corresponds to the observed survival time of the patient at t = 7.159, which is longer than any other but then had an ‘event’ at this (longest of all) time. Had this longest survivor not died in Figure 8.7b, then the corresponding observation would be T+ = 7.159 and S(7.159) = 0.1270 retains the same value as the previous S(3.953). In order to do the necessary calculations, the database must contain a row for each patient with two columns one containing the individual’s survival time, whilst the second contains either 0 or 1 indicating a censored or non-censored observation respectively. Thus, in Figure 8.7, ‘OSy’ represents the overall survival time of these patients from randomisation expressed in years, whilst ‘Dead’ obviously indicates whether or not still alive. The corresponding command stset OSy, fail(Dead) identifies the variable as a ‘time-to-event’ measure, the variable which indicates whether or not the event has occurred and outputs the K–M estimates at each time of death as in Figure 8.7. If we have two groups to compare then each will provide a K–M estimate of the corresponding survival curves. The summary statistic used is the hazard ratio (HR) which is the ratio of the risks of an event in the two groups concerned.

Example 8.10 Comparing groups – advanced neuroblastoma The K–M estimates of the individual survival curves are shown in Figure 8.9 calculated from a selection of 50 children with advanced neuroblastoma randomised to receiving either retinoic acid (R) or placebo (P) in a double-blind formulation. The selection of only 50 of the 176 children concerned is made to highlight features of the K–M graphics. This shows the step down at each time of death, identifies the censored observations by marking the points where they occur, shows the K–M estimate is unchanged following a censored observation, and how the number of infants in each group declines (some will have died whilst others have censored observations) from randomisation (time 0) to just short of 12 years. It is worth noting that if information of a censored observation becomes available (say in the Placebo group the patient censored at 5.04 years is then known to have died shortly after at 5.05) the whole of the updated K–M curve to the right will then drop down from 0.32 to 0.28 immediately before the next censored observation of 5.19 years. Similarly, if the last patient dies shortly after 11.29 years the K–M estimate would drop from 0.19 to 0. In order to calculate the HR, and compare the two intervention groups, requires the logrank test. The logrank test is based on the same principles as the Ranksum test described earlier. However, the presence of censored observations again makes the calculations tedious and error prone. Consequently, we omit the details here although these can be found in, for example, Tai and Machin (2014, Chapter 6).

8.5

EXAMPLES OF ANALYSIS

Example 8.10

167

(Continued)

The test of the null hypothesis is a test of the equality of the event (here death) rates between the groups with respect to the survival endpoint concerned. This is expressed as ratio by H0:HR0 = 1. Analysis commands stset OSy, fail(Dead) sts graph, by(Treat)

100

Overall survival (%)

HR = 0.99, 95% CI 0.68 to 1.44 75

Retinoic acid

50

25

Placebo

0 0

2

Placebo

25

15

8

6

Retinoic acid

25

16

11

9

At risk

4

6

8

10

12

6

3

0

4

3

0

Time from randomisation (years)

Figure 8.9 Kaplan–Meier (K–M) estimates of overall survival in 50 children with advanced neuroblastoma following good induction response subsequently randomised equally to receive placebo or retinoic acid. Source: Based on Kohler, Imeson, Ellershaw, et al. (2000).

Example 8.11

Logrank test – advanced neuroblastoma

Details of the necessary commands and output for obtaining the HR and associated logrank test on the full group of children with neuroblastoma randomised to receive R or P, given in a double-blind formulation, are shown in Figure 8.10. In addition to stset OSy, fail(Dead) the log-rank test command sts test Treat is added. Analysis commands

stset OSy,fail(Dead) sts test Treat Output – Number of observations and failures

Failure event:

Dead = 1,obs. time interval: (0, OSy]

--------------------------------------------176 obs,111 failures --------------------------------------------Output – log-rank test

---------------+----------------------------| Deaths Events Treat | Alive observed expected ---------------+----------------------------Placebo | 32 56 55.78 Retinoic acid | 33 55 55.22 ---------------+----------------------------Total | 65 111 111.00 ---------------+----------------------------chi2(1) = 0.00, Pr > chi2 = 0.9669

Figure 8.10 Edited commands and output for the analysis using the logrank test to compare survival curves. Source: Based on Kohler, Imeson, Ellershaw, et al. (2000).

The HR is estimated by the ratio of two ratios. One is the ratio of the observed number of deaths to those expected in the P group (56/55.78 = 1.0039) and the other (55/55.22 = 0.9960) for group R. Together these give HR = 0.9960/1.0039 = 0.99213 indicating a very marginal advantage to those receiving R. This is very close to the null hypothesis value, HR0 = 1, suggesting no significant difference between treatments. The way in which the number of expected events is calculated, assumes the null hypothesis is true, depends on the number of subjects within each group, the associated number of events, and the rank order of the individual survival times (censored and not censored) amongst all subjects. As there are only two intervention groups, the familiar z-test can be obtained as the square root of chi2(1) in Figure 8.10. In this example, z = √0.00 = 0.00 also as only two decimal places have been provided in the output. This would give a pvalue = 1 exactly! However, the computer holds more decimal places internally and provides (Pr>chi2 = 0.9669) or a p-value = 0.97.

8.6

8.6

REGRESSION METHODS

169

Regression methods

In Section 2.10 we introduced the linear model y = β0 + βTreat x + ε,

(8.13)

where y represents the endpoint of interest in the clinical trial, and x = 0 and x = 1 the two interventions concerned, whilst β0 and βTreat are the regression constants to be estimated from the trial data. We also stated that ε represents the noise (or error) and this is assumed to be random and have a mean value of 0 across all subjects recruited to the trial, and standard deviation (SD), σ. However, subtle changes to the form of this model may have to be made to accommodate different endpoint types. We discuss the changes necessary under the relevant sections below. We will not review the details of how the regression coefficients are estimated as this can be found in many statistical texts including Tai and Machin (2014). We also rely on computer packages for calculation purposes as is good practice for the clinical trial team. 8.6.1 Comparing means For this situation, the model of Equation (8.13) remains as it is with the proviso that the endpoint, y, can be regarded as a continuous variable with an approximately normal distribution within each of the (two) groups concerned. In Figure 8.11 of Example 8.12 the analysis command is regress, the endpoint variable concerned SASSAD, and the two groups are defined by Treat. If we compare this output with that of Figure 8.2 then there are some familiar quantities, for example, here standard error (SE) = 2.3955 there 2.395 528, and here t = −2.62 there t = 2.6202. In addition, from this output the estimated regression coefficients are b0 = 6.6500 and bTreat = +6.2768. This leads to the following model for our data y = 6 6500 + 6 2768x

(8.14)

Also, we noted in Section 2.10, the difference between treatments is estimated by bTreat which we now compare with diff = 6.276 829 in Figure 8.2. Thus, the two methods of analysis give identical results apart from some arithmetic rounding errors. We shall see that this regression approach to analysis gives greater flexibility in more complex situations.

Example 8.12

Regression model – eczema

In order to compare the mean SASSAD reduction with Azathioprine and Placebo in patients with moderate-to-severe eczema the regression command necessary to fit the linear model, together with the resulting output, is given in Figure 8.11.

170

8 BASICS FOR ANALYSIS

Example 8.12

(Continued)

Analysis command regress SASSAD Treat Output -----------------------------------------------------------Source | SS df MS Number of obs = 61 ---------+------------------------F(1, 59) = 6.87 Model | 529.62 1 529.62 Prob > F = 0.0112 Residual | 4551.33 59 77.14 ---------+------------------------Total | 5080.95 60 -----------------------------------------------------------SASSAD | Coef SE t P>|t| (95% CI) ---------+-------------------------------------------------cons | 6.6500 Treat | 6.2768 2.3955 2.62 0.011 1.48 t o11.07 ------------------------------------------------------------

Figure 8.11 Edited command and output for the analysis from a regression package used to compare the SASSAD score for eczema between two treatment groups. Source: Data from Meggitt, Gray, and Reynolds (2006).

8.6.2 Proportions and the odds ratio It is often convenient, when comparing proportions in two groups, to express these in a relative way by quoting the odds ratio, OR. This is defined by OR =

πT 1 − π T πT 1 − π s or πS 1 − πs πS 1 − πT

(8.15)

The corresponding expressions for the SE and CI are quite complex but are given in Section 8.9 for completeness.

Example 8.13

Odds ratio – recurrent malaria

For the data of Example 8.7, OR =

pAL 1 − pAL pAQ + SP

1 − pAQ + SP

=

0 1020 1 − 0 1020 0 0172 1 − 0 0172

= 6.49. The result is then expressed as indicating an increased risk with AL.

8.6

REGRESSION METHODS

Example 8.13

171

(Continued)

The null hypothesis of equality of the proportions corresponds to OR = 1. Use of the expressions of Equations (8.24) and (8.25) can provide an approximate confidence interval although we used a statistical package to obtain the 95% CI as 2.13–16.67, and a p-value = 0.0001. Both indicate a statistically significant result in favour of AQ + SP.

In this binary response situation, it is clear that the endpoint, y, can only take two values 0 or 1, perhaps representing Present/Absent or Yes/No. Therefore, y cannot have the Normal distribution form which has a possible range of values from –∞ to +∞. Equally, the proportions, pS and pT, observed with the feature of interest in each group must lie between 0 and 1 (equivalently 0–100%). One solution to this is to use, on the left-hand side of Equation (8.13), y = log [p/(1 − p)] instead of y itself; this transformation is termed logit (p). Suppose p = 0.01 then logit(0.01) = −4.60; if p = 0.5 then logit(1) = 0 and if p = 0.99, then logit(0.99) = +4.60. These values suggest a range of possible values on this logit scale from –∞ to +∞ with 0 in the middle. That being so, the linear model is then expressed as logit p = β0 + βTreat x + ε

(8.16)

The regression parameters of this model can be estimated using logistic regression.

Example 8.14 Logistic regression – oral lichen planus In the randomised trial conducted by Poon, Tin, Kim, et al. (2006), one endpoint of concern was the clinical response at 4 weeks post-randomisation to C or S. The observed response rates were pS = 0.5211 and pC = 0.4545, giving an odds ratio, p 1 − pS 0 4545 1 − 0 5211 = 0.7657 indicating a greater response rate = OR = C 0 5211 1 − 0 4545 pS 1 − pC in favour of S. This is almost exactly reproduced in the lower section of Figure 8.12 where it is highlighted. The associated 95% CI is 0.39–1.50, which covers the null hypothesis value of OR0 = 1 and the p-value = 0.44.

172

8 BASICS FOR ANALYSIS

Example 8.14

(Continued)

However, usually preceding this section, is the analysis command to fit the model (8.16). This gives estimates of the regression coefficients as b0 = 0.0846 and bTreat = −0.2669. From these, we have exp(−0.2669) = 0.7658, which equals the OR we had before. Thus, the analysis commands (logit Response Treat) and (logistic Response Treat) are two ways of carrying out the same analysis; one expressing the results in terms of regression coefficients, the other by use of the OR.

Basic tabulation tabulate Treat Response -------------+----------------------+---------| Response | Treat | No Yes | Total -------------+----------------------+---------Steroid | (a) 34 (b) 37 | 71 Cyclosporine | (c) 36 (d) 30 | 66 -------------+----------------------+---------Total | 70 67 | 137 -------------+----------------------+---------Analysis command logit Response Treat Output Logistic regression Number of obs = 137, LR chi2(1) = 0.61,Prob > chi2 = 0.4358 -------------------------------------------------------------Response | Coef. SE z P>|z| (95% CI) ----------+--------------------------------------------------Treat | -0.2669 0.3429 -0.78 0.436 –0.9389 to 0.4051 Cons | 0.0846 0.2376 -------------------------------------------------------------Analysis command logistic Response Treat Output Logistic regression Number of obs = 137,LR chi2(1) = 0.61,Prob > chi2 = 0.4358 ----------------------------------------------------------------Response | Odds ratio SE z P>|z| (95% CI) ----------+-----------------------------------------------------Treat | 0.7658 0.2625 -0.78 0.436 0.3911 to 1.4995 -----------------------------------------------------------------

Figure 8.12 Edited commands and output from a regression package used for the logistic regression analysis of binary data from a trial in patients with oral lichen planus. Source: Data from Poon, Goh, Kim, et al. (2006).

8.6

REGRESSION METHODS

173

8.6.3 Ordered categorical In the case of logistic regression, the endpoint has one of the two values, 0 or 1. When we have ordered categorical data that can take one of J responses, it is convenient to label the response options 1, 2, … up to J and to think of this situation as cutting the underlying (continuous) response into segments. Then, we can develop a model that is based on a generalisation of logistic regression. There are a number of approaches that can be adopted. We describe one of the more common ones that is known as the proportional odds model. Equation (8.16) described the logistic model as logit(p) = β0 + βTreatx + ε. Since p is the probability that the outcome (Y) is positive (that is, equal to 1), we could instead have written P(Y = 1) as meaning the probability that Y = 1, giving: logit{P(Y = 1)} = β0 + βTreatx + ε. The proportional odds model extends this for the J possible outcomes, giving: logit P Y ≤ j

= α j + βTreat x + ε, for j = 1, 2, …, J − 1

(8.17)

The αj values represent a set of constants for the J − 1 cut-points. Thus, the model is effectively analysing each of the J − 1 ways that the responses can be combined, ≤1, ≤2, ≤3, … ≤ J − 1. Since βTreat does not have a subscript j, it will have the same value for all of the J − 1 thresholds. Thus, the odds ratio, OR = exp(βTreat), is assumed to be the same no matter how we collapse the categories – hence the term ‘proportional odds’. When fitting model (8.17) using a computer package, the software is effectively solving the equation for all possible J − 1 thresholds simultaneously, so as to obtain the single best estimate, bTreat, for the parameter βTreat. Apart for the fact that there are now J − 1 cut-points instead of the single binary one, the computer output from ordered logistic regression is very similar to that of (simple) logistic regression and can be displayed either as coefficients or odds ratios. In fact, ordered logistic regression can be applied to the binary data of Figure 8.12, in which case identical results would be obtained. It is, however, more complex to use in as much as when there are several response options it becomes important to scrutinise the validity of the assumptions – and, in particular, whether the odds ratio really does appear to be constant across all of the thresholds. If this is not the case, other models should be explored. 8.6.4 Time-to-event Although we do not go into full details in the case of time-to-event data, the basic regression model is modified to log HR = βTreat x + ε

(8.18)

This is termed the Cox proportional hazards regression model, more details of which can be found in Tai and Machin (2014).

174

8 BASICS FOR ANALYSIS

Example 8.15

Cox model – advanced neuroblastoma

Using the survival times of the children with advanced neuroblastoma of Figure 8.10, the Cox model of Equation (8.18) is fitted using the command (stcox Treat, nohr) of Figure 8.13. The output leads directly to give log (HR) = −0.007 88x from which, with x = 1, the HR = exp(−0.007 88) = 0.992 15 which, apart from a rounding error on the last digit, is precisely that obtained using the logrank approach of Example 8.11. Alternatively, the HR can be obtained directly using the same command but omitting (,nohr).

Analysis commands stcox Treat, nohr stcox Treat Edited Output failure _d: Dead, analysis time _t: OSy No. of subjects = 176, No. of failures = 111 Total time at risk = 748.2573582 ---------------------------------------------------------_t | Coef SE z P>|z| (95% CI) -------+-------------------------------------------------Treat | -0.00788 0.1900 -0.041 0.97 -0.380 to 0.364 -----------------------------------------------------------------------------------------------------------------_t | HR SE z P>|z| (95% CI) -------+------------------------------------------------Treat | 0.99215 0.1885 -0.041 0.97 0.68 to 1.44 ---------------------------------------------------------

Figure 8.13 Edited commands and output from a regression package using the Cox model to compare survival curves. Source: Data from Kohler, Imeson, Ellershaw, et al. (2000).

8.6.5 Adjusting for baseline Although these results are no different from those we had earlier, the advantage of the Cox, and any other regression model, is that it can be extended to include baseline covariates by suitably adapting Equation (8.18). In the case of a single baseline covariate, c, with corresponding parameter, γ, the model becomes:

8.6

REGRESSION METHODS

175

log HR = βTreat x + γc + ε

(8.19)

For example, suppose that age, recorded on the date of randomisation, is known to be prognostic for outcome. Then, the investigators will need to check whether the treatment comparison, in this case the estimate of βTreat, is modified by taking the baseline variable age into account.

Example 8.16

Cox model with covariate – advanced neuroblastoma

To take account of the children’s age when making the comparison of the treatments P and R, the continuous variable (Agey) is added to the regression model of Figure 8.13 using the command (stcox Treat Agey, nohr). As shown in Figure 8.14, this estimates the corresponding model by log(HR) = 0.00723Treat + 0.05644Agey. Thus bTreat = +0.0072 rather than −0.0079 obtained from the model with no covariate included. Numerically there is little change in the

Analysis commands stcox Treat Agey, nohr stcox Treat Agey Output – giving regression coefficients failure _d:

Dead, analysis time _t:

OSy

Cox regression No. of subjects = 176, No. of failures = 111 Time at risk = 748.2573582 -------------------------------------------------------------_t | Coef. SE z P>|z| (95% CI) -------+-----------------------------------------------------Treat | +0.00723 0.19054 0.04 0.970 -0.3662 to 0.3807 Agey | 0.05644 0.02493 2.26 0.024 0.0076 to 0.1053 -------------------------------------------------------------Output – giving HR -------+-----------------------------------------------------t | HR SE z P>|z| (95% CI) -------+-----------------------------------------------------Treat | 1.0073 0.1919 0.04 0.970 0.6933 to 1.4633 Agey | 1.0581 0.0264 2.26 0.024 1.0076 to 1.1111 -------+------------------------------------------------------

Figure 8.14 Edited commands and output from a statistical package using the Cox model to compare survival in two treatment groups taking into account patient age at diagnosis which is known to be strongly predictive of outcome. Source: Data from Kohler, Imeson, Ellershaw, et al. (2000).

176

Example 8.16

8 BASICS FOR ANALYSIS

(Continued)

corresponding regression coefficient for treatment although now turning a very marginal advantage to R over P to a very marginal disadvantage. However, the corresponding p-value = 0.97 indicates that there is little evidence against the null hypothesis of no difference between treatments. The corresponding HR is obtained directly using (stcox Treat Agey). Thus, although age appears to be prognostic in these patients (p-value = 0.024), taking account of age in an adjusted analysis changes the estimate of the HR only marginally from the unadjusted value of 0.9922 of Figure 8.13 to 1.0073. Nevertheless, care needs to be taken in ensuring this is truly the case. Thus, a more detailed scrutiny of these data is required before a firm view of how survival is influenced by age can be determined. We are using this as a statistical example – not as a definitive indication.

In the context of a clinical trial, we are not directly interested in the influence of a covariate, but only whether knowledge of this covariate influences to any degree our view concerning the value of βTreat. Thus, age in Example 8.15 constitutes what is known as a ‘nuisance covariate’. As we have noted, in that example, age appears to be of strong prognostic importance but does not change our estimate of βTreat materially. The above example provides us with the rationale for using regression-based methods for the analysis of clinical trials. Seldom will a univariate comparison, whether of means, proportions or survival rates, be sufficient for the eventual analysis of any randomised trial. Consequently, most trials will require some adjustment to the simple comparison, perhaps necessitated by using a stratified design or when (as is often the case) known prognostic factors (assessed prior to randomisation) are present. In theory, the simplest model of Equation (8.13), suitably adapted to the type of endpoint variable of concern, can be extended to y = β0 + βTreat x + γ 1 c1 + γ 2 c2 + ε

(8.20)

However, the main focus still remains on estimating βTreat, which measures the effect of the intervention. Good practice should confine the c variables (the covariates) to, at most, one or two key (and known) major influencing prognostic factors. Thus, although most trials will record many features of the patients recruited, for example, gender, age, ethnicity and so on, these variables are mostly chosen for descriptive purposes and are not anticipated to have any major impact on prognosis. Thus, they play no role in the analysis and are not candidates for inclusion in Equation (8.20). Assmann, Pocock, Enos and Kasten (2000) discuss the use and misuse of baseline data in the context of clinical trials.

8.6

REGRESSION METHODS

177

8.6.6 Dummy variables In general, prognostic factors may be of continuous, binary or of categorical form. However, if they are of the latter type, care needs to be taken in how they are dealt with in a regression model. If the data are ordered categorical then the rank order of the categories, say, 1, 2, …, k, can be used in the regression model in the same way as a continuous variable. The only proviso is to ask: How reasonable is it that the steps between successive categories are equal? On the other hand if the variable concerned is an unordered categorical variable, say, the patients are classified as pathology Type A, B or C (with no ordering indicated by the A, B, C) then to include such a variable in the model so-called dummy variables have to be created as in Figure 8.15.

Dummy variables are created as follows: If a patient is Type A then v1 = 1, if not then v1 = 0. If a patient is Type B then v2 = 1, if not then v2= 0. In this way a patient who is Type A is described by the pair (v1 = 1, v2 = 0), and a patient who is Type B is described by the pair (v1= 0, v2 = 1). Any patient who is Type C (he/she is not A and not B), will therefore be described by the pair (v1 = 0, v2 = 0).

Figure 8.15 Creating the dummy variables for an unordered categorical variable with three categories A, B and C

In this way, a single variable of an unordered categorical nature, such as Pathology Type, is included in a covariate model very similar to (8.20) and becomes y = β0 + βTreat x + γ 1 v1 + γ 2 v2 + ε

(8.21)

In this way, the null hypothesis that Pathology Type does not influence outcome is expressed by H0: γ 1 = γ 2 = 0.

8.6.7 Time-varying covariates In some circumstances, the possibility of an event occurring post-randomisation may influence the outcome in some way. For example, in a trial in cancer patients scheduled to receive one of two chemotherapy regimens, emergency radiotherapy

178

8 BASICS FOR ANALYSIS

may have to be initiated for some patients. The fact that radiotherapy is given may impact on patient survival time from randomisation and thereby influence the comparison of the two chemotherapy regimens under study. Here, the covariate (radiotherapy given: Yes or No), with the corresponding interval postrandomisation (to when radiotherapy is initiated) recorded, is termed a timevarying covariate. Although we give no details here, the methods for analysis are described in Tai and Machin (2014) and elsewhere. 8.6.8 Adjusting for strata As we discussed in Chapter 5, randomisation in a clinical trial may be stratified if there is a feature of the patient or the disease condition which has a major bearing on the outcome irrespective of the particular intervention that will be allocated. In these circumstances, the randomisation is organised such that S and T interventions are proportionally represented within each strata in the ratio set by the design, most often 1 : 1. At the analysis stage, a comparison is first made between S and T within each stratum, perhaps a difference in means (say, dStratum). These are then combined over the strata, taking due account of the differing patient numbers in each stratum, to provide a weighted mean, dStrata .

Example 8.17 Stratified by stage – advanced neuroblastoma Now the corresponding randomisation in the trial of children with neuroblastoma was stratified by two disease stages and this fact needs to be taken account of in the analysis. This can be achieved by expanding the command for the logrank test of Figure 8.10 to (sts test Treat, strata(stage)). This analysis is given in Figure 8.16a to give the strata adjusted expected number of deaths which we compare with the corresponding highlighted values when not taking the strata into 55 55 11 = 0 9960. Alternatively, the stratified account. The adjusted HRAdjusted = 56 55 89 analysis can be undertaken using Cox regression as shown in Figure 8.16b. This gives HRAdjusted = 0.9959, p-value = 0.98 and, additionally, the 95% CI 0.68–1.45. In this example, there is no material difference between the unstratified and stratified analyses, so the interpretation remains unaltered.

8.7

OTHER ISSUES

Example 8.17

179

(Continued)

(a) Stratified log-rank test Command sts test Treat,strata(Stage) Edited Output Stratified log-rank test for equality of survivor functions ---------------+-----------------------------------| Deaths Deaths Deaths from Treat | observed expected(*) Figure 8.10 ---------------+-----------------------------------Placebo | 56 55.89 (55.78) Retinoic acid | 55 55.11 (55.22) ---------------+-----------------------------------Total | 111 111.00 (111.00) ---------------+------------------------------------

(*) sum over calculations within each stage chi2(1) = 0.00, Pr > chi2 = 0.98 (b) Stratified Cox model Command stcox Treat, strata(Stage) Edited Output Stratified Cox regression. Number of obs = 176, No. of failures = 111, Time at risk = 748.26 -----------------------------------------------| HR z P>|z| (95% CI) -------+---------------------------------------Treat | 0.9959 -0.02 0.98 0.68 to 1.45 ------------------------------------------------

Figure 8.16 Edited commands and output from (a) a stratified logrank test and (b) a stratified Cox model, to compare survival curves. Source: Based on Kohler, Imeson, Ellershaw, et al. (2000).

In a similar way to the stratified Cox model for time-to-event outcomes, stratified analyses can be performed for continuous and binary data using linear regression and logistic regression, respectively.

8.7

Other issues

8.7.1 Missing values If information is missing on an endpoint variable, then the trial size is effectively reduced so that, at the very least, the precision of the final estimate of the difference measure between interventions will be reduced. If the number missing is considerable,

180

8 BASICS FOR ANALYSIS

then there must be some concern as to whether the comparisons between groups may be biased in some way. In certain situations, ‘missing’ values may be anticipated and are not ‘missing’ in the conventional sense of the relevant information being ‘absent’.

Example 8.18

Missing data – eczema

In assessing the clinical response following treatment for moderate-to-severe eczema of Example 8.9 and summarised in Figure 8.5, there may have been good clinical reasons why six patients were ‘Not assessed’. However, if the possibility of these arising was anticipated at the planning stage of the trial, then suitable action for their occurrence should have been pre-specified. Thus, a statement in the protocol might indicate that any ‘Not assessed’ patients will be regarded as ‘Worse’ (or ‘Ignored’) for trial summary purposes. Figure 8.17 summarises the results obtained Analysis command: Missing values Omitted nptrend Grade, by(Treat) Edited output ----------------------------------------------Treat Score Obs Sum of ranks ----------------------------------------------Azathiporine 0 36 1154 Placebo 1 19 386 ----------------------------------------------z = -2.67, Prob > |z| = 0.008 Analysis command: Missing values classed as Worse nptrend NewGrade, by(Treat) Edited output ---------------------------------------Treat Score Obs Sum of ranks ---------------------------------------Azathiporine 0 41 1390 Placebo 1 20 501 ---------------------------------------z = -1.88, Prob > |z| = 0.061

Figure 8.17 Further analysis of the trial data of Figure 8.5 for SASSAD reduction in patients with eczema. Source: Data from Meggitt, Gray, and Reynolds (2006).

8.7

OTHER ISSUES

181

Example 8.18 (Continued) if the ‘missing’ cases are ‘ignored’ or regarded as ‘worse’ in the example of Figure 8.5 but uses the command (nptrend), rather than the (ranksum) of Figure 8.3, to compare the two groups of ordered categorical data. This example illustrates a critical situation in that one analysis suggests a statistically significant advantage of Azathioprine over Placebo (p-value = 0.008), whereas the other analysis suggests the difference is of marginal statistical significance (p-value = 0.061). This is one reason why Good Clinical Practice insists that the statistical analysis plan in the protocol must specify what is to be done to avoid such circumstances from arising.

8.7.2 Graphical methods In many situations summary statistics, albeit very useful, do not necessarily convey all the salient features of the trial results. For example, although Figure 1.1 indicates an advantage to Azathioprine in patients with moderate-to-severe eczema there is considerable variation about the respective mean values of 12.93 and 6.65%. Thus, even in the Placebo group, the majority of patients appear to benefit from the treatment they received. Further, there are also many patients receiving Azathioprine who do not appear to benefit greatly. Thus, the graphs in the corresponding report by Meggitt, Gray and Reynolds (2006, Figure 2) provide the reader with a very useful view of what is going on. Many tips to assist with the graphical presentation of data are provided by Freeman, Walters and Campbell (2008). 8.7.3 Multiple endpoints We have based the above discussion on the assumption that there is a single identifiable endpoint or outcome, upon which treatment comparisons are based. However, often there is more than one endpoint of interest within the same trial. For example, a trial could be assessing wound healing time, pain levels and MRSA infection rates. If one of these endpoints is regarded as more important than the others, it can be named as the primary endpoint of the trial and any statistical analysis focuses on that and that alone. A problem arises when there are several outcome measures which are all regarded as equally important. A commonly adopted approach is to analyse each endpoint distinctly from every other. Unfortunately, the multiple significance tests, and associated CIs, are all calculated using the same patients albeit on distinct outcome variables for each. It is well recognised that this causes the p-values to become distorted. Often a smaller p-value will be considered necessary for statistical significance to compensate for this. An equivalent strategy to this is to multiply the p-values obtained by the number of endpoints analysed, say k, and hence only declare those comparisons as statistically significant if the revised value is less than (say) the conventional 0.05.

182

8 BASICS FOR ANALYSIS

However, there is no entirely satisfactory solution to this problem, so we would reiterate the importance of identifying only one, two at the most, primary endpoints and confine hypotheses testing just to these. Any analyses conducted on other endpoints should be regarded more as hypothesis generating comparisons rather than as drawing definitive conclusions concerned with treatment differences. 8.7.4 Interim analyses In some circumstances, during the recruitment stage of a clinical trial investigators may wish to examine the data as it accumulates and perform an interim analysis to obtain a preliminary view of the possible outcome from the trial. Usually, an interim analysis takes the form of that planned for the primary endpoint at the end of the trial. This analysis may concern a continuous, binary, ordered categorical or time-to-event variable and so would be similar to one or other of the examples we have described in this chapter. Thus, no new principles are concerned. However, the trial design team have to decide on the frequency of such interim analyses and also be aware that multiple looks at (ever-accumulating) data on the same endpoint variable raises the same issues as those concerned with multiple endpoints. The requirement for such interim analyses may arise if an independent data and safety monitoring board (DSMB) is established to oversee the trial on behalf of the Trial Steering Committee (TSC). In Chapter 10, we describe the function of a DSMB and give more details concerned with interim analysis and their role in monitoring the accumulating trial data with a view to anticipating circumstances where stopping recruitment prematurely to a trial may be considered.

8.8 Practice As should be fully appreciated, clinical trials are a major undertaking and that the precious data that have been accumulated deserve a careful and rigorous analysis which, as we have underlined, the main framework is set out within the protocol. This analysis demands the full use of the statistical team, the latest analytical techniques including graphics routines, using well validated and appropriate computer software. No home-grown analysis packages should be used nor hand calculations – even for elementary comparisons. It is also important that the data are fully explored, although checking for inconsistencies and out-of-range problems should have been addressed as the trial progresses starting with the information from the first recruit. Many trials take a long time to complete, so that analysis plans set down in the protocol may not be the most appropriate when the time comes for analysis and synthesis. Although the trial team are obliged to analyse and report in the way prescribed, this should not prevent them from making full use of any new (statistical and software) developments that have arisen in the interim. Any such supplementary analysis should be reported as such and the implications from any (new) interpretations highlighted. The statistical team must be involved in all stages of the trial, from beginning to end, it is poor practice just to pass the final ‘parcel’ of data to an outside statistical team who are not fully integrated into the process.

8.9

TECHNICAL DETAILS

8.9

183

Technical details

8.9.1 Recommended method for comparing proportions Although we have given the components of the CI for a difference in proportions using the general formulation of Equation (8.3) and the specific SE of Equation (8.11) such an expression strictly only applies to large samples. For all situations, large or small, Newcombe and Altman (2000, pp. 46–49) describe a ‘recommended’ method for calculating a 100(1 − α)% CI for a single proportion which has lower limit (LL) to upper limit (U) defined by LL = A − B C to UL = A + B C, where A = 2np + z 21 − α 2 ; B = z 1 − α

2

z 21 − α

2

(8.22)

+ 4np 1 − p ; C = 2 n + z21 − α

2

, n is the

number of subjects in the group and p the observed proportion of responses. In the case of two intervention groups of size n1 and n2 with corresponding estimated proportions of p1 and p2, the estimated difference is d = p1 − p2. To calculate the CI, Equation (8.22) is evaluated for each group, to obtain LL1, UL1, and LL2, UL2. The 100(1 − α)% CI for the true difference in proportions, δ, is then p1 − p2 −

p1 − LL1

2

+ UL2 − p2

2

to p1 − p2 +

p2 − LL2

2

+ UL1 − p1

2

(8.23) Note that the difference, d, is not generally at the midpoint of this interval. 8.9.2 Confidence interval for an odds ratio The odds ratio can be calculated directly from the four cells of the 2 by 2 table such as those in Figure 8.12. Thus OR =

ad bc

(8.24)

The estimated 100(1 − α)% CI for the true OR is given by exp log OR − z 1 − α 2 SE log OR to exp log OR + z1 − α 2 SE log OR ,

(8.25)

where

SE log OR =

1 1 1 1 + + + a b c d

(8.26)

However, (8.26) is only useful if the components a, b, c and d are relatively large.

CHAPTER 9

Trial Size

This chapter outlines in general terms the basic components required for trial size determination. However, the calculation of the number of patients to recruit to a trial depends on several factors an important one of which is the type of endpoint concerned, for example continuous, binary or time-to-event, and hence, the form the statistical analysis will ultimately take. The approach to sample size calculation requires the concepts of the null and alternative hypotheses, significance level, power and, for the majority of situations, the anticipated difference between groups or effect size. We stress the importance of providing a realistic estimate of the latter at the design stage.

9.1

Introduction

Investigators, grant-awarding bodies and biotechnology companies all wish to know how much a proposed trial is likely to cost them. They would also like to be reassured that their money is well spent, by assessing the likelihood that the trial will give unequivocal results. In addition, the regulatory authorities including the Committee for Proprietary Medicinal Products in the European Union, the Food and Drug Administration in the United States of America and many others require information on planned trial size. To this end, many pharmaceutical and related biomedical companies adopt guidelines for Good Clinical Practice (GCP) in the conduct of their clinical trials, and these generally specify that a sample size calculation is necessary. When designing a new trial, the size (and of course the design) should be chosen so that there is a reasonable expectation that the key question(s) posed will be answered. If too few patients are involved, the trial may be a waste of time because realistic medical improvements are unlikely to be distinguished from chance variation. A small trial with no chance of detecting a clinically meaningful difference between treatments is unfair to all the subjects who are put to the risk and discomfort of the clinical trial. On the other hand, recruiting too many participants is a waste of resources and may be unfair if, for example, a larger than necessary number of patients receive the inferior treatment when one treatment could have been shown to be more effective with fewer patients. Providing a sample size is not simply a matter of identifying a single number from a set of tables but a process with several stages. At the preliminary stage, what are required Randomised Clinical Trials: Design, Practice and Reporting, Second Edition. David Machin, Peter M. Fayers, and Bee Choo Tai. © 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.

186

9 TRIAL SIZE

are ‘ball-park’ figures that enable the investigator to judge whether or not to start the detailed planning of the trial. If a decision is made to proceed, then a subsequent stage is to refine the calculations for the formal trial protocol itself. For example, when a clinical trial is designed, a realistic assessment of the potential superiority (the anticipated benefit or effect size) of the proposed test therapy must be made before any further planning. The history of clinical trials research suggests that, in certain circumstances, rather ambitious or overly optimistic views of potential benefit has been claimed at the design stage. This has led to the conduct of trials of insufficient size to answer reliably the underlying questions posed. To estimate the number of subjects required for a trial, we have to first identify a single major outcome that is regarded as the primary endpoint for measuring efficacy. If a trial has more than one primary endpoint, we have to evaluate the sample size for each endpoint in turn (possibly allowing for the anticipated use of a correction factor to allow for the multiplicity of statistical tests at the eventual analysis stage) and perhaps choose the maximum of these estimates as the eventual trial size (see Section 9.6.1). The regression coefficient of βTreat in the model (2.1) corresponds to the true or population difference between two groups. Once we conduct our clinical trial, then the corresponding data estimate this quantity with bTreat. However, at the planning stage of the trial, we certainly do not know βTreat since to determine this is the research question. Neither do we know bTreat as the trial has not yet been conducted. Nevertheless at the planning stage, in order to calculate sample size, we need to postulate a value for βTreat, which we denote by βTreat,Plan. We hope that βTreat,Plan will be close to βTreat, but we do not know this. At the end of the trial, bTreat may or may not be close to βTreat,Plan but, whether this is the case or not, we use it as an estimate of βTreat. The associated confidence interval provides a measure of our uncertainty with respect to the true value. Although for analysis purposes in Chapter 9, we focused more on confidence intervals rather than on testing hypotheses and p-values, for the purpose of planning the size of a trial the discussion is easier conducted in terms of testing hypotheses. 9.1.1 Caution In what follows it is important to keep the differences between βTreat, βTreat,Plan and bTreat in mind as, sometimes, the notation we have to use in subsequent sections of this book does not always make this easy. Further using β for the power (see below), as well as when describing the regression coefficients, may also cause confusion although this usage tends to be standard practice.

9.2 Significance level and power 9.2.1 Significance level At the planning stage of a trial, we have to define a value for the significance level, α, so that once the trial is completed and analysed, a p-value below this would lead to the

9.2

SIGNIFICANCE LEVEL AND POWER

187

rejection of the null hypothesis. Thus, if the p-value ≤ α, one rejects the null hypothesis of a zero-effect size, δ = 0 and concludes that there is a statistically significant difference between the interventions – in other words δ 0. On the other hand, if the p-value > α, then one does not reject the null hypothesis and accepts that δ could indeed be zero. Although the value of α is arbitrary, it is often taken as 0.05 or 5%. Even when the null hypothesis is in fact true, there is still a risk of rejecting it. To reject the null hypothesis when it is true is to make a Type I error. Plainly the associated probability of rejecting the null hypothesis when it is true is equal to α. The quantity α is interchangeably termed the significance level or probability of a Type I (or false-positive) error and often as the test size. 9.2.2 The alternative hypothesis Usually with statistical significance tests (see Chapter 8), by rejecting the null hypothesis, we do not accept any specific alternative hypothesis. Hence, it is usual, and good practice, to report the range of plausible population values of the true difference with a confidence interval. However, sample size calculations require us to provide a specific alternative hypothesis, HA. This specifies a particular value of the true effect size δ which is not equal to zero. 9.2.3 Power The ‘power’ of a significance test is the probability that such a test will produce a statistically significant result, given that a true difference between groups of a certain magnitude truly exists. The clinical trial could yield an observed difference that would lead to a p-value > α even though the null hypothesis is really not true, that is, δ 0. In such a situation, we then accept (more correctly phrased as ‘fail to reject’) the null hypothesis although it is truly false. This is called a Type II (false-negative) error and the probability of this is denoted by β. The probability of a Type II error is based on the assumption that the null hypothesis is not true, that is, δ 0. There are clearly many possible values for δ in this instance and each would imply a different alternative hypothesis, HA, and a different value for the probability β. Here specifying δ 0 corresponds to what is termed a 2-sided alternative hypothesis as δ can be 0. A one-side alternative hypothesis would specify, for example, that δ > 0. The power is defined as one minus the probability of a Type II error, thus the power equals 1 − β. That is, the power is the probability of obtaining a ‘statistically significant’ p-value when the null hypothesis is truly false and so δ has a value not equal to zero. The choice of the magnitude of the power is arbitrary but is frequently taken as either 90 or 80%; the latter means there is a one-in-five chance of a false negative, that is, failing to detect a true difference of the specified magnitude (see Section 9.5.2).

188

9 TRIAL SIZE Test statistically significant

Difference exists (HA true)

Difference does not exist (H0 true)

Yes

Power (1− β)

Type I error (α)

No

Type II error (β)

Figure 9.1 Relationship between Type I and Type II errors and significance tests

The relationship between Type I and II errors and significance tests is given in Figure 9.1 It is of crucial importance to consider sample size and power when interpreting statements following a completed trial and mentioning ‘non-significant’ results. In particular, if the power of the trial was very low in the first place, all one can conclude from a non-significant result is that the question of the presence or absence of differences between interventions remains unresolved. A trial with low power is only able to detect large treatment differences, and in many cases, the existence of a large difference may be implausible and the planning alternative hypothesis therefore unrealistic.

9.3 The fundamental equation In a trial comparing two groups, with n subjects per group and a continuous outcome variable, y1 and y2 summarise the respective means of the observations taken. Further, if the data are Normally distributed with equal and known (population) SDs, σ, then the standard errors are SE y1 = SE y2 = σ n. Thus, the estimated difference between the groups is d = y2 − y1 and this has SE d = σ

1 n

+

1 n



2 n.

The two groups are

compared using the z-test of Equation (8.4). Here we assume that SE d is the same whether the null hypothesis, H0, of no difference between groups is true, or the alternative hypothesis, HA, that there is real difference of size δ. Figure 9.2 illustrates the distribution of d under the null and alternative hypotheses. The two distributions are such that d has a Normal distribution either with mean 0 or mean δ, respectively depending on which of the two hypotheses is true. If the observed d from a trial exceeds a critical value, then the result is declared statistically significant. For a significance level α (here assumed one-tailed for expository purposes only), we denote this critical value by dα. Under the assumption that the null hypothesis, H0, is true d has mean 0, and the critical value for statistical significance, dα, is determined by dα − 0 σ

2 n

= z 1 − α hence, dα = z 1 − α σ

2 n

(9.1)

Here z1−α is the value along the horizontal axis of the standard Normal distribution which, in Figure 8.1, corresponds to a 1-sided area (1 − γ) = α in the upper tail of that distribution centred on 0.

9.3

THE FUNDAMENTAL EQUATION

189

Probability density

1.5

1.0 H0 HA 0.5

0 –1

0.8 0 Standardised variable, d¯/SE(d¯)

–0.5

1.5

2

Figure 9.2 Distribution of d SE d under the null and alternative hypotheses

In contrast with when the null hypothesis is assumed true, under the assumption that the alternative hypothesis, HA, is true, d now has mean δ 0 but the same SE d = σ

2 n

In this case the probability that d exceeds dα is 1 − β and this implies that dα − δ σ

2 n

= − z 1 − β hence, dα = δ − z1 − β σ

2 n

(9.2)

Here z1−β is the value along the horizontal axis of the standard Normal distribution of Figure 8.1 and corresponds to the 1-sided area γ = (1 − β) in that distribution. Commonly used values of zγ for different 1- and 2-sided values of α and 1-sided (1 − β) are given in Table T3. Equating the two expressions (9.1) and (9.2), and rearranging, we obtain the sample size for each group in the trial as n=2

σ δ

2

z1 − α + z1 − β

2

=

2 z1 − α + z1 − β Δ2

2

,

(9.3)

where Δ = δ/σ is termed the standardised effect size. Equation (9.3) is termed the Fundamental Equation as it arises, in one form or another, in many situations for which sample sizes are calculated. The use of Equation (9.3) for the case of a two-tailed test, rather than the one-tailed test, involves a slight approximation since d is also statistically significant if it is less than

190

9 TRIAL SIZE

–dα. However, with δ positive the associated probability of observing a result smaller than –dα is negligible. Thus, for the case of a 2-sided test, we simply replace z1−α in Equation (9.3) by z1−α/2. To evaluate Equation (9.3), a planning value for Δ is required and α and β have to be specified. The fundamental equation has to be modified for the specific experimental design proposed for the trial. Thus, if the allocation ratio, 1 : φ, that is the relative numbers of patients to be recruited in each group, differs from 1 : 1 then Equation (9.3) is modified somewhat and further adapted to obtain N, the total number of subjects required. Thus N=

1+φ φ

2

z1 − α

2

+ z1 − β

Δ2Plan

2

,φ > 0

(9.4)

Consequently, when comparing Standard (S) and Test (T) interventions, the number of subjects in the groups are nS = N/(1 + φ) and nT = Nφ/(1 + φ), respectively. It must be emphasised that the standardised effect size, ΔPlan, in these equations refers to what the investigators anticipate the true value of Δ will be. Once the trial is completed the data provide an estimate, DData of the true value, ΔPopulation. This estimate may or may not correspond closely to ΔPlan although the investigators will hope that it does.

9.4 Specific situations 9.4.1 Comparing means If the variable being measured is continuous and can be assumed to have a Normal distribution, then when the subjects are randomised to the two options in the ratio 1 : φ Equation (9.4) is modified to become 1+φ φ

N= The quantity z 21 − α likely to be small.

2

2

z1 − α

2

+ z1 − β

Δ2Plan

2

+

z 21 − α 2

2

,φ > 0

(9.5)

2 adjusts Equation (9.4) for situations when sample sizes are

Example 9.1 Difference in means – disease activity In Example 1.7, Meggitt, Gray and Reynolds (2006) in their trial in patients with moderate-to-severe eczema anticipated a δPlan = 14-unit difference in disease activity between the A and P treatments and a standard deviation of σ Plan = 17 units. Together these provide, an anticipated standardised effect size ΔPlan = δPlan/σ Plan = 14/17 = 0.82 or approximately 0.8. Further their design stipulated a 2 : 1, A : P allocation ratio, φ = 2, a 2-sided significance level α = 0.05 (5%) and a power, 1 − β = 0.8 (80%). Use of Table T3 gives z1−0.05/2 = z0.975 = 1.96 and

9.4

SPECIFIC SITUATIONS

191

Example 9.1 (Continued) z1–β = z0.8 = 0.8416. Substituting these in Equation (9.5) gives N = 1 96 + 0 8416 0 82

2

1+2 2

2

2

+ 1 96 2 = 57.11. However, to be divisible by 1 + φ = 3 this is rounded upwards to 60. From this nS = N/3 = 20 and nT = 2N/3 = 40. Had a 1 : 1 allocation ratio been used then N = 50.98 or 52 to allow equal numbers per group: 8 fewer than the number required with the unequal allocation ratio, φ = 2. 9.4.2 Comparing proportions If the outcome variable of the 2-group design is binary, such as when a satisfactory response to treatment either is or is not observed, then the total number of subjects required, for anticipated difference δ = π T − π S, is obtained from N=

1+φ φ

z1 − a

2

1 + φ π 1 − π + z1 − β πT − πS

φπ S 1 − π S + π T 1 − π T 2

2

(9.6)

Here π T and π S are the proportions anticipated to respond in the respective groups and π = π S + φπ T 1 + φ . The number to be recruited to each group is nS = N/(1 + φ), and nT = Nφ/(1 + φ), respectively.

Example 9.2

Complete response rate – multiple myeloma

In the randomised trial of Figure 4.1 conducted by Palumbo, Bringhen, Caravita, et al. (2006), Melphalan and Prednisone (MP) was compared with Melphalan, Prednisone and Thalidomide (MPT) in elderly patients with multiple myeloma. A principal endpoint was complete response. At the design stage, the planning values for the response rates were set as 5% with MP and 15% with MPT. Further their design stipulated a 1 : 1 allocation ratio, a 2-sided significance level α = 0.05 (5%) and a power, 1 − β = 0.9 (90%). Here φ = 1, while π MPT-Plan = 0.15 and π MP-Plan = 0.05 were the anticipated proportions with complete response in the respective groups and hence π = 0 05 + 1 × 0 15 1 + 1 = 0.1. Use of Table T3 gives z1−α/2 = z0.975 = 1.96 and z1−β = z0.9 = 1.2816. Substituting these values in Equation (9.6) gives N=

1+1 1

1 96

2 × 0 1 × 1 − 0 1 + 1 2816

0 05 × 1 − 0 05 + 0 15 × 1 − 0 15

0 15 − 0 05

2

2

= 374.10 or

approximately 380 subjects. This was the planned size used by the investigators. Had the trial been planned with a power of 80%, rather than 90%, the required sample size would have been about 290.

192

9 TRIAL SIZE

In certain situations, rather than comparing treatments or interventions a trial may be concerned with the evaluation of two alternative diagnostic procedures. A single diagnostic procedure is usually assessed relative to a gold standard (GS). For example, in a binary situation, a patient may be assessed for a particular disease of concern and, following (say) examination of a pathological specimen may be classified as disease absent (GS-No) or present (GS-Yes). This may provide the gold standard. The new diagnostic test (T) is also undertaken and the patient classified as T-No or T-Yes. Over a series of patients, the two diagnostic procedures are compared. Ideally, perfect agreement between the GS and T classifications is achieved although, in practice, this is not always the case. Figure 9.3 gives the usual tabular format for the results of such a study. Gold standard (GS) Cancer: No

Cancer: Yes

Cancer: No

e

f

e+ f

Cancer: Yes

g

h

g+ h

GSNo= e + g

GSYes= f + h

n

Test (T )

Figure 9.3 Comparison of the results from a new diagnostic test against that of the gold standard to indicate or otherwise a diagnosis of cancer

The summary measures obtained from Figure 9.3 are the sensitivity: Se = h/(f + h), the specificity Sp = e/(e + g), the positive predictive value PPV = h/(g + h) and the negative predictive value NPV = e/(e + f). These measures are all binomial proportions but the pertinent sample size, for example for Se, is (f + h) which may be considerably less than n. When conducting a clinical trial of two new diagnostic tests A and B to compare, for example, their sensitivities, these tests will be applied after a 1 : φ randomisation to the respective groups. All target patients will also be tested using the GS. In order to identify an appropriate number of subjects for the target group, NTarget, planning values for SeA and SeB are required as well as the usual considerations with respect to the allocation ratio, test size and power. Once these are set, Equation (9.6) can be used to determine NTarget.

Example 9.3

High-risk prostate cancer

Hofman, Lawrentschuk, Francis, et al. (2020) compared conventional imaging (ConI) with prostate-specific membrane antigen PET-CT imaging (ProI) in order to detect the presence of metastatic disease (the Target group) in patients with high-risk prostate cancer. The GS assessment comprised a predefined composite panel encompassing histopathologic, imaging, clinical and biochemical findings (Panel).

9.4

SPECIFIC SITUATIONS

193

Example 9.3 (Continued) Planning values of SeConI = 0.65 and SeProI = 0.90 were set, and the design values of φ = 1, 2-sided α = 0.1 and 1 − β = 0.85 specified. Hence, from Table T3, z1−0.1/2 = 1.6449 and z0.85 = 1.0364. Use of Equation (9.6) then gives the number of patients with metastatic disease required as NMetastatic = 78. The anticipated prevalence of metastatic disease amongst high-risk prostate cancer patients was initially set as π Metastatic = 0.4 and so the number of patients to be included in the study and screened for the presence of any metastatic disease is estimated as NScreen = NMetastatic/π Metastatic = 78/0.4 = 195 or 200. The investigators subsequently revised the anticipated prevalence of metastatic disease to π Metastatic = 0.25 which would imply NScreen = 78/0.25 = 312. Allowing an approximate 10% patient drop out this might then be increased to about 350. However, the investigators used a comparison of the areas under receiver operating characteristic curves as a basis for planning their study and recruited a total of 300 patients. Their results are summarised in Figure 9.4 which estimates SeCoI = 0.67 and SeProI = 0.94. The observed proportion of patients with metastatic disease, as determined by the Panel, was pMetastatic = (27 + 36)/(150 + 145) = 63/295 = 0.21.

Panel CoI

Negative

Positive

Total

Metastatic

Negative

29

9

38

disease

Positive

94

18

112

123

27

150

Total

SeCoI = 18/27 = 0.67 ProI Metastatic

Negative

6

2

8

disease

Positive

103

34

137

109

36

145

Total

SeProI = 34/36 = 0.94

Figure 9.4 Sensitivity of two diagnostic tests in detecting any metastatic disease in high-risk prostate cancer patients against that determined from a panel. Source: Data from Hofman, Lawrentschuk, Francis, et al. (2020).

194

9 TRIAL SIZE

When planning, for example, a comparison of sensitivities of two diagnostic tests it is, as we indicated in Example 9.3, the number of subjects viewed as positive by the GS evaluation, NTarget, which is crucial. NScreen is determined in order to give the investigators the magnitude of the recruitment task ahead in order to obtain NPositive patients. In practice, recruitment should continue until NTarget patients are obtained and no further. Thus, recruitment stops irrespective of whether this is a smaller, equal or greater number than NScreen. 9.4.3 Ordered categorical data With only two categories in the scale, we have the binary case described above, although an alternative approach would have been to formulate this in terms of an odds ratio (OR), rather than as a difference in proportions. This formulation leads to very similar sample sizes with the small differences arising due to some approximations that have to be made in that approach. However, although perhaps not as intuitive as a difference, working in terms of an OR scale leads to the extension from the binary split to the situation in which an ordered categorical variable is used for endpoint assessment. The process is quite complex so we illustrate it by means of an example. Figure 9.5 is a collapsed and restructured version of Figure 8.5 from Example 8.9 in that some of the categories have been merged so that the sample size process is a little easier to describe. In this example, every patient is classified into one of the κ = 4 improvement categories at the end of the trial and the numbers falling into the respective categories in each treatment noted. The individual proportions are then calculated by dividing these by the corresponding numbers in each treatment group. Number P Improve

Category, i

None

0

6

Proportion

A

Cumulative proportion

P

A

P

A

pPi

pAi

qPi

qAi

OR

5

0.3158

0.1389

0.3158

0.1389



Slight

1

4

6

0.2105

0.1667

0.5263

0.3056

2.86

Moderate

2

8

9

0.4211

0.2500

0.9474

0.5556

2.52

Marked

3

1

16

0.0526

0.4444

1.0000

1.0000

14.41

Total

19

36

1.0000

1.0000

Figure 9.5 Investigator assessed response to treatment in patients with moderate-to-severe eczema. Source: Adapted from Meggitt, Gray and Reynolds (2006).

The corresponding odds ratios are the chance of a subject being in a given category or higher in one group compared to the same categories in the other group. For category i, which takes values 1, 2 and 3, the odds ratio is given by ORi =

qPi 1 − qAi qAi 1 − qPi

(9.7)

9.4

SPECIFIC SITUATIONS

195

Using this equation in the data of Figure 9.5 gives OR1 =

qP1 1 − qA1 qA1 1 − qP1

=

0 3158 1 − 0 1389 0 1389 1 − 0 3158

= 2.86, OR2 = 2.52 and OR3 = 14.41. Although by no means the case in this illustration, for sample size purposes, an assumption is made which specifies that the underlying population odds ratios will be the same irrespective of the category division chosen. Thus, for all categories defined this is then set to ORPlan. In practice, the individual ORs obtained from such as our example may be averaged to provide an initial overall ORPlan – perhaps by using either their median value of 2.86 or the geometric mean of 4.70. In designing a trial, the first requirement is to specify the proportion of subjects anticipated, in each category of the scale, for one of the groups. For κ = 4 categories these anticipated or planning proportions for one group are set as π P1, π P2, π P3 and π P4 respectively where π P1 + π P2 + π P3 + π P4 = 1. Further, we define planning QP1 = π P1, QP2 = π P1 + π P2, QP3 = π P1 + π P2 + π P3 and QP4 = π P1 + π P2 + π P3 + π P4 = 1. From these, and using the specified ORPlan, the planning values for the comparator QPi . Once these are obtained, the plagroup are then calculated as QAi = 1 − QPi OR Plan + QPi nning values π A1, π A2, π A3 and π A4, respectively, can be determined. Finally, π i = π Ai + π Pi 2 , the average proportion of subjects anticipated in category i, is required. In the case of κ categories, the required total sample size for a 1 : 1 randomisation is Nκ =

4Γκ z 1 − α

2

+ z1 − β

log ORPlan

2

2

,

(9.8)

where Γκ = 3

1−

κ

π 3i ,

(9.9)

i=1

and for κ = 2, Γ2 = 1 π 1 − π When the number of categories is large, it is clearly difficult to postulate the proportion of subjects who would fall into a given category but if the number of categories exceeds five, then Γ5+ in Equation (9.9) is approximately equal to 3. This then simplifies the calculations a great deal as they now only depend on ORPlan, the significance level, α, and power, 1 − β. The assumption of constant OR implies that it is justified to use the anticipated OR from any pair of adjacent categories for planning purposes.

Example 9.4 Moderate-to-severe eczema Suppose a confirmatory trial of that conducted by Meggitt, Gray and Reynolds (2006) is planned on the basis of the information provided in Figure 9.5. There

196

9 TRIAL SIZE

Example 9.4 (Continued) the corresponding odds ratios are all greater than what would be regarded as a clinically useful improvement. As a consequence, the investigators set a lower value of ORPlan = 2, but used the observed proportions in each improvement category with P as the basis for their planning.

Improvement

Category

Placebo (P)

Azathioprine (A)

i

πPi

QPi

QAi

πAi

None

0

0.32

0.32

0.1905

0.1905

πi 0.2553

Slight

1

0.21

0.53

0.3605

0.1700

0.1900

Moderate

2

0.42

0.95

0.9048

0.5443

0.4822

Marked

3

0.05

1.00

1.0000

0.0952

0.0726

Total

1.00





1.0000

Figure 9.6 Planned confirmatory trial of that conducted by Meggitt, Gray and Reynolds (2006). Source: Modified from Meggitt, Gray and Reynolds (2006).

Using Equation (9.9), with the values from Figure 9.6, Γ4 = 3/[1 − (0.25533 + 0.19003 + 0.48223 + 0.07263)] = 3.4722, then with 2-sided α = 0.05, z0.975 = 1.96, power 1 − β = 0.8, z0.8 = 0.8416, and so from Equation (9.8) N4 =

4 × 3 4722 × 1 96 + 0 8416 log 2 2

2

= 226.90 ≈ 230. Hence, on a 1 : 1 allocation

nP = nA = 115 patients per intervention.

As it has become clear, sample size calculation in this situation is not a straightforward process but the book by Machin, Campbell, Tan and Tan (2018) has specialist software included to enable this to be implemented. 9.4.4 Time-to-event As indicated in Chapter 8, comparisons between groups when summarising time-toevent data can be made using the logrank test and the summary statistic used is the hazard ratio (HR). The test of the null hypothesis of equality of event rates between the groups with respect to the event (endpoint) concerned provides the basis for the sample size calculations. This is expressed as H0:HR = 1. Pre-trial information on the endpoint, either as the anticipated median ‘survival’ for each group or as the anticipated proportions ‘alive’ at some fixed time point, will usually form the basis of the anticipated difference between groups for planning purposes. The

9.4

SPECIFIC SITUATIONS

197

corresponding effect size is HRPlan. If proportions alive at a chosen time point are anticipated to be π S and π T, then HRPlan =

log π T log π S

(9.10)

On the other hand, if a planning value of the median survival time MS of, for example the S group is given then this implies that, at that median time, half are alive and half not, so that π S = 0.5. Further, if MT is given, then HRPlan = MS/MT, and use of Equation (9.10) allows a planning value of π T = exp log 0 5 × HRPlan = exp − 0 6932 × HRPlan

(9.11)

to be obtained. Once HRPlan is obtained then the total number of events required to be observed is E=

1 1 + φHRPlan φ 1 − HRPlan

2

z1 − α

2

+ z1 − β

2

(9.12)

The corresponding total number of subjects needed in order to observe these events is N=

1 + φ 1 + φHRPlan φ 1 − HRPlan

2

2

z1 − α 2 + z1 − β 1 − πS + φ 1 − πT

(9.13)

Hence, nS = N/(1 + φ) and nT = Nφ/(1 + φ) subjects, respectively, are required in the two intervention groups.

Example 9.5 Gastric cancer Cuschieri, Weeden, Fielding, et al. (1999) compared two forms of surgical resection for patients with gastric cancer. The primary outcome (event of interest) was time to death. The authors state: Sample size calculations were based on a pre-trial survey of 26 gastric surgeons, which indicated that the baseline 5-year survival rate of D1 surgery was expected to be 20%, and an improvement in survival to 34% (14% change) with D2 resection would be a realistic expectation. Thus 400 patients (200 in each arm) were to be randomised, providing 90% power to detect such a difference with P < 0.05.

Here π D1 = 0.2, π D2 = 0.34 and so from Equation (9.10), HRPlan = log 0.34/log 0.2 = (−1.0788)/(−1.6094) = 0.6703. The authors set 1 − β = 0.9, imply a 2-sided significance level α = 0.05, and a randomisation in equal numbers to each group, hence φ = 1. First making use of Table T3 implies z1−0.025 = 1.9600 and

198

9 TRIAL SIZE

Example 9.5 (Continued) z0.9 = 1.2816, then substituting all the corresponding values in Equation (9.13) 1 + 0 6703 2 1 9600 + 1 2816 2 = 369 44 . This suggests that gives N = 2 1 − 0 6703 1 − 0 2 + 1 − 0 34 at least N = 370 patients with gastric cancer are required.

9.5 Practical considerations 9.5.1 Trial objectives It is customary to start the process of estimating sample size by specifying the size of the difference required to be detected and then to estimate the number of participants required to enable the trial to detect this difference if it really exists. Given that this is a plausible and a scientifically or medically important change then, at the planning stage, the investigators should be reasonably certain to detect such a difference after completing the trial. ‘Detecting a difference’ is usually taken to mean ‘obtain a statistically significant difference with p-value ≤ 0.05’. Similarly, the phrase ‘to be reasonably certain’ is usually interpreted to mean something like ‘have a chance of at least 90% of obtaining such a p-value’ if there really is a difference of the magnitude anticipated. The major components of this process are summarised in Figure 9.7 to which, in the case of an endpoint which is continuous, ‘SD size, σ’ needs to be added. Design option Effect size, δ

The anticipated (planning) size of the difference between the two groups

Type I error, α

Equivalently the significance level of the statistical test to be used in the analysis

Type II error, β

Correspondingly the power, 1 − β

Withdrawal rate, W

Anticipated withdrawal or lost to follow up rate

Allocation ratio, φ

The relative numbers of subjects to be included in each of the two intervention groups

Figure 9.7 Components necessary to estimate the size of a comparative trial

9.5.2 The anticipated effect size A key element in the design is the ‘effect size’ that is reasonable to plan to observe – should it exist. The way in which possible effect sizes are determined will depend on

9.5

PRACTICAL CONSIDERATIONS

199

the specific situation under consideration. Sometimes there may be very detailed prior knowledge which then enables an investigator to anticipate what effect size between groups is likely to be observed, and the role of the trial is to confirm that expectation. In general, estimates of the anticipated effect size may be obtained from the available literature, formal meta-analyses of related trials or elicited from expert opinion. In practice, a range of plausible effect size options is usually considered before the final planning effect size is agreed. For example, an investigator might specify a scientifically or clinically useful difference that it is hoped could be detected and would then estimate the required sample size on this basis. These calculations might then indicate that an extremely large number of subjects is required. As a consequence, the investigator may next define a revised aim of detecting a rather larger difference than that originally specified. The calculations are repeated, and perhaps the sample size now becomes realistic in that new context. One problem associated with planning comparative clinical trials is that investigators are often overly optimistic about the magnitude of the improvement of a new treatment over the standard. This optimism is understandable, since it can take considerable effort to initiate a trial and, in many cases, the trial would only be launched if the investigator is enthusiastic enough about the new treatment and is sufficiently convinced about its potential efficacy. However, experience suggests that as trial succeeds trial there is often a growing realism that, even at best, earlier expectations were optimistic. There is ample historical evidence to suggest that trials that set out to detect large treatment differences nearly always result in ‘no significant difference was detected’. In such cases, there may have been a true and worthwhile treatment benefit that has been missed, since the level of detectable differences set by the design was unrealistically high, and hence the sample size too small to establish the true (but less optimistic) size of benefit. In the case of a continuous measure, where the difference between two groups is expressed by the difference between their means, δ = (μT − μS) and σ is the SD of the endpoint variable, there may be little prior information on the two components. In which case one may consider for planning purposes, values of their ratio ΔCohen = δ σ,

(9.14)

which is termed the standardised effect size. Based on experience from the social sciences, Cohen (1988) then, as summarised in Figure 9.8, made recommendations for meaningful interpretation of values for this ratio for sample size determination purposes. Thus, in planning a trial an investigating team may propose to calculate sample size on the basis of, for example, Δ = 1.5. Cohen’s suggestion would classify this a very large and, following from that, since very large differences between interventions are seldom seen in clinical trials this is likely to be an unrealistic expectation. At the other extreme, were they to propose Δ = 0.1 then Cohen’s suggestion would classify this a very small and (possibly) not worthy of investigation and in any case would automatically require a very large trial. Values close to Δ = 0.5 may be considered a reasonable expectation and associated with a more reasonable trial size.

200

9 TRIAL SIZE ∆Cohen

Standardised effect

≤ 0.2

Small

≈ 0.5

Moderate

≥ 0.8

Large

Figure 9.8 Interpretation of a planning standardised effect size. Source: Based on Cohen (1988).

Experience has suggested that in many areas of clinical research the Cohen (1988) suggestions can be taken as a good pragmatic guide for planning purposes, so even if the components’ values necessary to calculate Δ are available, it is worth checking the associated ΔPlan against the criteria of Figure 9.8.

Example 9.6 Anticipated effect size As we noted in Example 9.1, Meggitt, Gray and Reynolds (2006) stipulated δPlan = 14 units with standard deviation σ Plan = 17 units as their design criteria for assessing the difference in disease activity between A and P. The corresponding anticipated standardised effect size is ΔPlan = δPlan/σ Plan = 14/17 = 0.82. This would be regarded as a large effect using the Cohen criteria and so this would suggest (at the design stage) the possibility of a more modest outcome should be reviewed before finally deciding on patient numbers. When comparing two proportions, the standardised effect size becomes in this π A,Plan − π P,Plan case ΔCohen = , where π Plan = π P,Plan + π A,Plan 2. Thus, in the π Plan 1 − π Plan design of the trial of Example 9.2 in elderly patients with multiple myeloma conducted by Palumbo, Bringhen, Caravita, et al. (2006), they set π MPT-Plan = 0.15, π MP-Plan = 0.05 and so π Plan = 0 15 + 0 05 2 = 0.1 giving ΔPlan = 0 15 − 0 05 0 1 1−0 1

= 0.33. This represents a small to modest effect size using the Cohen criteria and thus seems to provide a realistic scenario for planning purposes.

9.5.3 Significance level and power The choices of the significance level and power for use in the sample size calculations are essentially arbitrary. However, accepted practice has built up over the years so the conventional value for α is 0.05 (5%), or less often 0.01 (1%). In contrast, although 1 − β = 0.8 (80%) for the power is recognised as a minimum requirement although many investigators fail to realise that 80% power means that they have a very high risk of a false

9.5

PRACTICAL CONSIDERATIONS

201

negative, which – assuming the new intervention really is superior – means there is a serious risk of a wasted effort in conducting their trial. There has been a move to increase this to 0.9 (90%) although the greater the power the greater the required sample size. The main reason for this is to ensure that clinical trials are able to provide convincing evidence of relative efficacy of the interventions being compared. Meggitt, Gray and Reynolds (2006) used a power of 80% for their trial while Palumbo, Bringhen, Caravita, et al. (2006) used a power of 90%. Both sets of investigators chose a 2-sided significance level of 5%. 9.5.4 Allocation ratio From a statistical perspective, an allocation ratio of 1 : 1 is usually the most efficient, in that it produces a minimum sample size for a given effect size, α and β. However, there may be situations where other ratios may be indicated. For example, Meggitt, Gray and Reynolds (2006) used a 2 : 1 randomisation in favour of A over P in patients with moderate-to-severe eczema. They used this ratio to encourage recruitment … and to increase the likelihood of identifying infrequent adverse events.

Thus, part of their rationale was concerned with obtaining information on a secondary endpoint; in this case the occurrence of possible adverse events. Another situation in which unequal allocation may be useful is, for example, when there is a restricted supply of the new or test intervention whereas the standard is more readily available. For example, Erbel, Di Mario, Bartunek, et al. (2007) in Example 1.12 were investigating a new bioabsorbable stent for coronary scaffolding and one might foresee the supplies of the ‘experimental’ stent being limited while those used in current practice are readily available. In such cases, the number of patients for which the test stent can be given is fixed (perhaps at a relatively small number), but recruiting more than this number to receive the control stent could increase the statistical efficiency of the design. However, this is, in a sense, a case of ‘limited resources’ as we discuss in the following section. 9.5.5 Limited resources A common situation is one where the number of subjects, often patients, that can be included in a trial is governed by non-scientific forces such as time, money or human resources. Thus, with a predetermined (maximal) sample size, the researcher may then wish to know what probability he or she has of detecting a certain effect size with a trial confined to this size. If the resulting power is small, say δPlan – should they stop recruiting and claim benefit to T? Should they be wrong and T appears worse than S – should they stop recruiting to protect the patients? Should they be wrong and T is about as effective as S – should they stop recruiting as there is little point in continuing? These concerns suggest that interim looks at the data may be desirable. With such circumstances in mind, a DSMB may be established, and the trial design team have to decide on the frequency of such interim analyses. They will be aware that multiple looks at (ever accumulating) data from the same endpoint variable raise the same issues as those concerned with trial with multiple endpoints (as discussed in Chapter 8). Thus, in this situation too, the lack of independence of the analyses causes the p-values to become distorted – the greater the number of interim analyses the more the distortion. Essentially, this implies that the more one looks at the data (and hence more frequently tests the null hypothesis), the more likely one is to reject the null hypothesis at the 5% level even if it is really true. This raises the distinct possibility of inappropriately claiming efficacy and stopping a trial early as a consequence. This highlights the difficulties posed by any interim analysis and that is that they are necessarily based on relatively few patients are very sensitive to chance fluctuations and, if declared statistically significant, provide an estimate of the treatment effect much larger than that anticipated by the design. Thus, such analyses have to be viewed with extreme caution by the investigating team before taking a view of the implications. 10.4.2 Efficacy In this scenario, the DSMB might have to consider whether to stop a trial if the data accumulated so far appear to suggest a larger, and therefore clinically extremely important, difference between treatments in favour of T over S than that anticipated by the design team. In which case, superiority of T over S would be claimed. Clearly if such an

220

10

DATA AND SAFETY MONITORING

extreme benefit was truly present, then it would seem important that the trial should stop immediately and the ‘proven’ treatment be immediately recommended for clinical use. In judging such a situation, the DSMB will be aware that the uncertainty surrounding the potential benefit was high amongst the TSC at the planning stage of the trial and that this uncertainty justified the trial question being posed in the first place. Further, this uncertainty provided the clinical equipoise to enable investigators to seek informed consent and randomise their patients. The DSMB will also be aware that the purpose of conducting and completing the trial as planned is to reduce (one can never eliminate all) this uncertainly to such levels that the TSC would then be able to firmly recommend (say) the better treatment for clinical use. Thus, the DSMB is faced with a dilemma of whether to recommend to stop the trial early on the evidence from a relatively few patients or continue the trial and thereby potentially deprive some patients of the better option. If the trial concerns a life-threatening disease, rather than (say) a condition which will eventually resolve whatever the treatment given, then different decisions with respect to ‘stopping’ may be anticipated. In the latter situation perhaps a DSMB need not be established at all and provided safety is not an issue, the only reason to stop early would be failure to recruit. Decisions with respect to recommending whether a trial should stop might also differ if T appears to be doing (much) worse than S. In which case, inferiority of T when compared to S is inferred. We should emphasise, however, that the role of a randomised controlled trial is to influence clinical practice and that this entails continuing a trial until there is sufficient evidence to convince even the most sceptical clinicians of the relative advantage of one intervention over the other. Experience suggests that trials are sometimes stopped too early and therefore may fail to convince reviewers, editors or readers of their findings. In which case, their conduct is a waste of resources and also such early closure may prevent future patients from receiving an intervention which might have brought benefit. Thus, the DSMB must consider these potential consequences of early stopping in their deliberations.

Example 10.6 Heroin dependence Schottenfeld, Chawarski and Mazlan (2008) investigated in a double-blind controlled trial the value of maintenance treatment with placebo, buprenorphine or naltrexone in detoxified heroin dependent patients. The trial planned to recruit 180 patients randomised equally to the three interventions, whilst the report was based on 126. The author’s state: The study was terminated after 22 months of enrolment because buprenorphine was shown to have greater efficacy in an interim safety analysis.

10.4 INTERIM REVIEWS

221

10.4.2.1 Stopping for efficacy In order to assist the DSMB to make a recommendation with respect to trial continuation, the protocol may have set out details of the frequency of interim analyses that are to be presented to the committee. As we have outlined, if several interim analyses of the accumulating trial data are to be conducted then these may compromise the nominal overall significance level, α, which has been set when determining the sample size for the trial concerned. In the usual test of significance for a 2-sided α = 0.05, statistical significance of the ztest of Equation (8.3) is indicated if the value calculated, zData, is below −1.96 or above +1.96. The test uses the fact that approximately 95% of the area under the standard normal distribution lies within two standard deviations (SDs) of zero. Whether the two interventions are regarded as statistically significantly different depends on the actual value of zData from which the p-value is obtained from Table T2. This situation corresponds to approximately 2 × SE’s either side the estimated difference, d, between the two intervention groups. If this range is extended to 3.00 SDs, then from Table T2, z = 2(1 − 0.998 65) = 0.0027 and this area increases to 99.73%. Thus, a pvalue less than 0.0027 would only occur if zData is more than 3 × SE’s either side the estimated difference, d. This fact led Haybittle (1971) to suggest that any interim analysis of accumulating data should only be declared significant if the p-value is less than αHaybittle = 0.0027. In this situation, at the close of the trial, following the recruitment of all the initially planned subjects, the final test of significance is then made with the associated 2-sided α = 0.05 as set when determining the trial size.

Example 10.7

Children undergoing general anaesthesia

In the trial, subsequently conducted by Drake-Brockman, Ramgolam, Zhang, et al. (2017) in children older than 1 year undergoing general anaesthesia, a stopping rule was implemented and the trial terminated before full recruitment was achieved. The trial endpoint was the occurrence of perioperative respiratory adverse events (PRAE). It was designed to compare the use of laryngeal mask airways (LMA) with endotracheal tubes (ET). For ET the design specified π ET = 0.35 which was anticipated to be reduced using LMA to π LMA = 0.20. Hence, δPlan = 0.15. Specifying a 2-sided test size of α = 0.05 (5%), power 80%, 1 : 1 randomisation and a 5% dropout suggested 290 participants were required. The authors state: The interim analysis was done when 50% of recruitment was completed by use of the Haybittle–Peto boundary rule for group sequential testing (α1 = 0.0027, α2 = 0.049 96) with a stopping condition imposed on the trial if a difference of 25% in the primary outcome was detected.

[The paper sets α2 = 0.049 96 but this has unreasonable precision and is more appropriately described as α2 = 0.05.]

222

10

DATA AND SAFETY MONITORING

Example 10.7 (Continued) In the event, the interim analysis involved 177 patients with pET = 50/94 = 0.5319 and pLMA = 15/83 = 0.1807 indicating a reduction of 0.3512 (35%) in PRAE. This is much greater than the threshold of 25% specified by the investigators. A statistical test comparing the two proportions leads to z = 4.8369 and pvalue = 0.000 001 3 which is much smaller than α1 = 0.0027 and thereby suggests the trial could be closed and superiority concluded for LMA. In this example, the test approach with pLMA = 0.18 was very close to the planning value of π LMA = 0.20 whereas, for the traditional choice, pET = 0.53 is much greater than π ET = 0.35. So, the trial stopped early as the standard ET did much worse than had been anticipated.

The Haybittle (1971) stopping rule suggestion applies to each test of significance conducted at every interim analysis irrespective of their number, and then, the final analysis for the trial report is conducted at the design value of α = 0.05. The rule is very easy to apply and there is no necessity to prespecify the number of interim analyses planned. They can therefore be conducted whenever they are required. Nevertheless, good practice should dictate that the trial protocol should set out clearly when and how many interim analyses are to be implemented. Peto, Pike, Armitage, et al. (1976) recommended a similar approach to Haybittle (1971) but with a more extreme p-value set at 0.001 making it even less likely that a trial will stop for superiority at an early stage. Hence, the general approach is termed Haybittle–Peto. However, alternative stopping rules can follow a more rigid interim analysis schedule. Thus, O’Brien and Fleming (1979) suggested the possibilities shown in Figure 10.3. Specifically, if the final, and one interim analysis, is planned, the test size is set at α1 = 0.0051 which is double the Haybittle (1971) suggestion. The final analysis has αFinal = 0.0415 rather than the planning values of 0.05. If two interim analyses are planned α1 = 0.001, α2 = 0.0151 and αFinal = 0.047. In general, as the number of interim analyses increases the corresponding α1 reduces and the trials would be less likely to close than were the Haybittle (1971) rule to be applied. Pocock (1983) suggested the possibilities also shown in Figure 10.3. Specifically, the p-value depends only on the number of interim analyses planned and remains at that level for every test of significance conducted. For example, if two interim and a final analysis are planned, the test size is set at αAll = 0.022 which, in this case, is about ten times the Haybittle (1971) suggestion. As we have indicated, the Haybittle–Peto rule is straightforward to apply as it does not require the number of interim analysis to be prespecified and uses an αFinal equal to that used for determining sample size. However, the more complex alternatives indicate ‘Stop’ at the interim stages in rather less extreme situations.

10.4 INTERIM REVIEWS

223

Number of analyses

Haybittle (1971)

O’Brien-Fleming (1979)

Pocock (1983)

Interim

1

0.0027

0.0051

0.029

Final

2

0.05

0.0415

0.029

Interim

1

0.0027

0.0006 (0.001*)

0.022

Interim

2

0.0027

0.0151

0.022

Final

3

0.05

0.047

0.022

Interim

1

0.0027

6 bursts of Prednisone = 2

Needed > 6 bursts of Prednisone = 2

250 completed study

244 completed study

Figure 11.4 Trial profile following the CONSORT guidelines. Source: Based on Szefler, Mitchell, Sorkness, et al. (2008).

the necessary information to make a judgement on whether or not this attrition of patients will have any impact on the clinical interpretation of the findings presented. 11.6.2 Participant characteristics Intervention and control groups comparable in relevant measures?

Although the eligibility criteria are specified in the protocol, it is clearly important to describe in the trial report the types of patients actually included and randomised. The (baseline) characteristics summarised usually include some basic demographic data, information on the condition under investigation and variables that are known or suspected to be prognostic for outcome. This has been done with patients with eczema in Figure 11.5 which summarises baseline demographic and clinical characteristics of the

11.6 FINDINGS

251

Treatment

Placebo

Azathioprine

20

41

36 (12)

30 (11)

Men (%)

16 (80%)

19 (46%)

TPMT activity (nmol/h/mLRBC)

10.4 (2.1)

10.3 (2.2)

2 (10%)

5 (12%)

Previous systemic therapy or phototherapy for eczema

16 (80%)

30 (73%)

Hay fever

14 (70%)

29 (71%)

Asthma

13 (65%)

27 (66%)

15/15 (100%)

34/35 (97%)

32.7 (8.9)

32.3 (13.2)

58.3 (17.9)

51.0 (21.0)

Patient-assessed itch score

5.7 (1.8)

5.4 (2.1)

Patient-assessed loss-of-sleep score

4.9 (2.6)

4.4 (2.5)

Quality of life score (DLQI)

9.4 (6.1)

9.7 (5.0)

Number of patients Demographic

Potentially prognostic

Age (years)

TPMT heterozygous range

Raised serum IgE Baseline assessment of endpoints

Disease activity (SASSAD) Body area involved

Data are: mean (SD) or number (%)

Figure 11.5 Demographic and clinical characteristics together with baseline assessments of disease in patients with moderate-to-severe eczema. Source: Based on Meggitt, Gray and Reynolds (2006).

participants in the trial of Meggitt, Gray and Reynolds (2006, Table 1). This tabulation extends beyond the demographics of age and gender to details concerning markers such as genetic polymorphism in thiopurine methyltransferase (TMPT) and serum immunoglobin E (IgE), concomitant disease, hay fever and asthma, previous treatment for eczema, and baseline assessments of several endpoint measures, disease activity (SASSAD), body area involved and itch score. The authors also indicate that more complete information is available in an accompanying web-based table which has been published online with the corresponding article. Whilst not a critical point, it is important to remember that the purpose of such a table is not for estimating, for example, the mean age of the participants within the placebo group, but merely for describing them. Thus, the range of ages from youngest to oldest, rather than the standard deviation (SD), is a more appropriate summary here. This equally applies to the other continuous variables within the table such as TMPT activity and patient-assessed itch score. Although a randomised trial, it is clear from Figure 11.5 that the characteristics of those in the placebo and azathioprine groups are not exactly identical although there

252

11 REPORTING

are no major disparities. However, what happens if we do statistical tests to see whether disparities actually occur? In which case Fayers and King (2008) point out: … we already know what the answer must be. Because the treatments are allocated by randomisation, any differences in the baseline characteristics must be purely due to chance. Even when randomisation is done properly, we expect approximately 5% of the characteristics tested to have Pvalues that are less than 0.05, and we expect 1% of characteristics to be significant with P < 0.01. In other words, if there were 20 baseline characteristics being explored, on average we would expect, purely by chance, that one characteristic would be significant with P < 0.05.

Significance tests of baseline imbalances are only useful as a means of testing whether there may have been a violation of the randomisation procedure. Despite these remarks, D’Haens, Baert, van Assche, et al. (2008, Table 1) in their trial in patients with newly diagnosed Crohn’s disease inappropriately conducted 12 statistical tests of baseline characteristics which are illustrated in Figure 11.6, although none turned out to be statistically significance at the 5% level. In contrast Meggitt, Gray and Reynolds (2006, p. 842) correctly noted: By chance, there was a sex imbalance between groups at baseline (table 1, webtable 2) ….

It is important to remember that many of the variables in the first table of a report are purely descriptive in nature; but any that are identified as major prognostic features for outcome at the design stage of the trial should be used to verify whether the estimate of treatment effect changes substantially when the analysis is adjusted for these using regression techniques. Any baseline characteristics that are of major prognostic value should be used as covariates, or as stratification factors if appropriate, in the analysis, irrespective of whether or not any imbalance in these measures between the intervention arms is present. 11.6.3 Endpoints Presentation of statistical material satisfactory? Confidence intervals given for the main results? Conclusions drawn from the statistical analysis justified?

It is essential that the main focus of any report on the outcome of a randomised clinical trial must be on the relative efficacy of the alternative interventions under test with respect to the primary endpoint(s) specified in the protocol. Thus, the intention of the eventual analysis is to enable a statement such as that of Girard, Kress, Fuchs, et al. (2008, p. 126) to be made. They summarise in the structured abstract of their paper by: Interpretation: Our results suggest that a wakeup and breathe control protocol that pairs daily spontaneous awakening trials (ie interruption of sedatives) with daily spontaneous breathing trials results in better outcome for mechanically ventilated patients in intensive care than current standard approaches and should become routine practice.

11.6 FINDINGS

253

Early combined immunosuppression (n = 65)

Conventional management (n = 64)

p value

Sex (female)

43 (66·2%)

37 (57·8%)

0·33*

Race (white)

64 (98·5%)

61 (95·3%)

0·37†

Age (years)

30·0 (11·8)

28·7 (10·9)

0·50‡

Weeks from diagnosis to treatment

2·0 (1·0–5·0)

2·5 (1·0–11·0)

0·65†

Height (m)

1·71 (0·09)

1·71 (0·10)

0·93‡

Weight (kg)

63·1 (13·4)

62·5 (12·1)

0·18*

Smoking Current

28 (43·1%)

23 (35·9%)

Former

8 (12·3%)

16 (25·0%)

29 (44·6%)

25 (39·1%)

3 (4·6%)

2 (3·1%)

Never Mesalazine use

Small bowel

14 (21·5%)

15 (23·4%)

Ileocolitis

31 (47·7%)

28 (43·8%)

Colitis

20 (30·8%)

21 (32·8%)

CDAI

IBDQ¶ C-reactive protein concentration (mg/L)

1·00† 0·90*

Disease location

score§

0·82‡

330 (92)

306 (80)

0·12†

122 (33)

136 (28)

0·11†

19 (5–75)

25 (8–59)

0·22†

Data are number (%), mean (SD), or median (IQR) unless otherwise specified. *χ2 test for dichotomous variables. †Student’s test for continuous variables. ‡Fisher’s exact test, §Crohn’s Disease Activity index scores range from 0 to 600; higher scores indicate greater disease activity. ¶Inflammatory Bowel Disease Questionnaire scores range from 32 to 224; higher scores indicate better health-related quality of life·

Figure 11.6 Table of baseline characteristics illustrating the inappropriate use of statistical significance tests for comparing groups which have been randomised. Source: D’Haens, Baert, van Assche, et al. (2008). © Elsevier

In describing their analysis, Meggitt, Gray and Reynolds (2006, Table 2) tabulate the results obtained from seven different endpoint variables which we summarise in Figure 11.7. This table does not, for example, report the simple difference between mean reduction in disease activity (SASSAD), but those obtained after adjustment for the four variables in the minimisation algorithm used for the dynamic method of treatment allocation (see Section 5.3.3). However, this detail is rather lost as an obscure footnote to their table. Also, differences between groups, after such adjustments, are not necessarily

254

11 REPORTING

Treatment

Placebo

Number of patients Reduction in disease activity (SASSAD)

Azathioprine Difference (95% CI)

20

41

6.6

12.0

5.4 (1.4 to 9.3)

14.6

25.8

11.2 (1.6 to 20.7)

Reduction in itch score

1.0

2.4

1.4 (0.1 to 2.7)

Reduction in loss-of-sleep score

1.2

2.5

1.3 (−0.1 to 2.6)

Improvement in quality of life (DLQI)

2.4

5.9

3.5 (0.3 to 6.7)

Reduction in % body area involved

Reduction in soluble CD30 Median reduction in combined moderate/potent

−12.6

3.3

16.0 (−0.3 to 32.3)

12.5

22.5

4.8 (−14.0 to 39.0)

topical steroid use (g per month)

Figure 11.7 Drug efficacy at 12 weeks in patients with moderate-to-severe eczema treated by either placebo or azathioprine. Source: Data from Meggitt, Gray and Reynolds (2006).

easy for the reader to interpret. So, it is usual to first give the unadjusted values and then point out whether or not these differences, once adjusted, substantially alter the interpretation of the trial results. In Figure 11.7, the simple differences all appear to equal the unadjusted difference, except for reduction in soluble CD30 (15.9 which is trivially different from 16.0) and median reduction in combined moderate/potent topical steroid use (10.0 which is very different from 4.8). The format of Equation (8.19) of Section 8.6.5 suggests a more satisfactory analysis that is available for this design, using regression techniques to include the baseline (pretreatment) assessment of SASSAD as a covariate, c, and the SASSAD values at 12 weeks as the continuous dependent variable, y, replacing log(HR). Nevertheless, Figure 11.7 clearly sets out the values of the estimated treatment difference for each endpoint variable concerned and quotes the corresponding 95% confidence intervals. A further useful addition would be the corresponding p-values for each comparison. Advice for the conduct and reporting of adjusted analyses are given, in the spirit of CONSORT, by Yu, Chan, Hopewell, et al. (2010) who state, for example, that details of the rationale, statistical methods used, and clarification of the choice of variables used for adjustment should be included in the trial report. 11.6.4 Adverse events Adverse effects of interventions reported?

In some situations, one or more of the interventions may raise concerns about safety issues, which then have to be balanced against its merits. For example, in the trial described in Example 3.7, relating to the use of alternative mattresses to reduce pressure

11.6 FINDINGS

255 Anacetrapib (mg) Placebo 0

10

40

150

300

Number of patients

10

10

10

10

10

Nausea

1



1



1

Headache

3

3

3

3

1

Diarrhoea







1

1

Pain in extremity

1









Abdominal pain

1









Dizziness

1

2





1

Figure 11.8 Incidence of adverse events in patients with dyslipidaemia. Source: Data from Krishna, Anderson, Bergman, et al. (2007).

sores, there was a concern that one mattress type (with the greater physical depth) might be associated with more patients falling from the bed. Thus, the corresponding protocol will have identified the ‘adverse events’ that should be documented and these should be reported in the trial-associated publications. In the a trial involving 40 patients with dyslipidaemia Krishna, Anderson, Bergman, et al. (2007) recorded 28 different types of adverse events and some of these are listed in Figure 11.8 by the five doses of anacetrapib received, which varied from 0 mg (Placebo) to 300 mg. In their article, the placebo results were placed as the final column of their table, but are placed here in the column to the left of that for 10 mg, to facilitate a visual inspection of trends over increasing dose. In fact, no patterns seem to be present but this is quite a small trial and clear patterns may not be expected. However, headache, the most common adverse event affecting 13/50 (26%) of patients, is least common in those receiving the highest dose of 300mg. The authors conclude, without any statistical comparisons, that: Anacetrapib was generally well tolerated … in patients with dyslipidaemia. There were no serious adverse events and no discontinuations due to clinical or laboratory adverse experiences. … All adverse experiences were transient and resolved without treatment.

In contrast, when reporting the adverse events given in Figure 11.9, Meggitt, Gray and Reynolds (2006, Table 4) do make a statistical comparison but not between the two randomised treatment groups. Neither are the details of this analysis described in the Statistical Methods section, but only added as a footnote to their table. Thus, they compare TMPT activity amongst the groups experiencing different levels of nausea. It is unclear whether those from both the Azathioprine and Placebo groups are included or whether it is the standard deviation or the standard error that is within the brackets [.], no confidence interval is quoted, and the form of analysis by grouping nausea into two categories (None and Mild versus Moderate and Severe) is less than optimal in any event. Considerable caution is required as to how one should interpret the apparently ad hoc analysis of their footnote.

256

11 REPORTING Treatment group

Number of patients

Placebo (%)

Azathioprine (%)

20

41

Adverse events Nausea*

None

15 (75)

20 (49)

Mild

5 (25)

10 (24)

Moderate (dose-limiting)



7 (17)

Severe



4 (10)

Headaches

3 (15)

5 (12)

Abdominal pain

2 (10)

4 (10)

Light-headedness

1 (5)

3 (7)

Malaise

2 (10)

1 (2)

Folliculitis

2 (10)

3 (7)

Respiratory tract

Lower



2 (5)

Infection

Upper

1 (5)

2 (5)



2 (5)

6 (30)

18 (44)

> 1 episode moderate lymphopenia (1–1.5 × 109/L)

4 (20)

10 (24)

Alanine transaminase increase > 15% above upper normal limit

2 (10)

4 (10)

Alanine transaminase increase > 50% above upper normal limit

1 (5)

2 (5)

Abnormalities in laboratory measures > 1 episode neutropenia (1–2 × 109/L) > 1 episode mild lymphopenia (1–1.5 ×

109/L)

*TMPT activity was not significantly different (p = 0.5) between participants with no nausea or mild nausea, and moderate or severe nausea (10.3 [2.3] vs. 10.7 [2.1] nmol/h/mL RBC).

Figure 11.9 Adverse events and laboratory abnormalities reported in patients with moderate-tosevere atopic eczema. Source: Based on Meggitt, Gray and Reynolds (2006).

11.6.5 Graphics One method of presentation of trial results that should not be overlooked is pictorial. If results can be displayed graphically, and also show the actual and individual trial data that has been collected, then they can be particularly informative. Thus, in the dot plot of Example 1.7 Figure 1.1, which is based on the results of the trial by Meggitt, Gray and Reynolds (2006, Figure 1A), it is evident that there has been a reduction in disease activity (scores greater than 0) as assessed by SASSAD for the vast

11.6 FINDINGS

257

Number of stools per episode of diarrhoea

35

30

25

20

15

10

5

0 LT patch

Placebo

Cumulative stools from individual episodes in LT-patch recipients versus placebo recipients, including analysis population as well as two individuals who had more than one episode (two and three episodes, respectively). Solid spot-moderate to severe episode. White spot-mild episode. Solid bar-mean number of diarrhoea stools per group. Note that placebo-to-vaccine rate is 1 to 1.88.

Figure 11.10 Dot plots of severity of diarrhoeal episodes and numbers of stools by treatment group. Source: Frech, DuPont, Bourgeois, et al. (2008). © Elsevier

majority of the patients following treatment for their eczema. Further, those receiving Azathioprine tend to show greater improvement than their counterparts receiving Placebo. However, it is also very clear that there is considerable overlap between the reductions achieved in the two groups. Despite the usefulness of Figure 1.1, an even more informative presentation would have been to present a dot plot, one for each treatment group, of the actual, rather than the percentage, change in SASSAD scores at baseline and at 12 weeks, and to join the corresponding individual patient values. Such plots might indicate, for example, that consistently lower values of SASSAD in all patients over the period but a steeper drop amongst those receiving Azathioprine. The very informative plot of Figure 11.10 is provided in the report by Frech, DuPont, Bourgeois, et al. (2008) and gives a clear indication of how a patch containing heat-labile toxin (LT) reduces problems associated with travellers’ diarrhoea. For those patients with diarrhoeal episodes, it illustrates the variation in the number of stools per episode with the corresponding mean by treatment group, as well as the severity of the individual episodes. Graphical representations are commonplace when reporting trials with a time-toevent endpoint and are particularly useful and informative. These graphs usually show the Kaplan–Meier estimates of the corresponding survival curves. An example is

258

11 REPORTING 100

SAT plus SBT Usual care plus SBT

Patients alive (%)

80

60

40

20

Patients events

167 168

74 97

0 0 Patients at risk SAT plus SBT Usual care plus SBT

167 167

60 120 180 240 300 Days after randomisation 110 85

96 73

92 67

91 66

86 65

360

76 59

Figure 11.11 Survival at 1-year of mechanically ventilated patients in intensive care. Source: Girard, Kress, Fuchs, et al. (2008). © Elsevier

Figure 11.11 which is taken from the trial of Girard, Kress, Fuchs, et al. (2008). Although there is a slight contradiction in the numbers allocated between those given ‘at risk’ and within the graphics panel, the figure clearly shows an improved survival in these mechanically ventilated patients of the ‘SAT plus SBT’ regime over those receiving ‘Usual care plus SBT’. It gives the number of patients randomised to each intervention group and indicates how the number at risk declines within each intervention group as the year following randomisation progresses. A useful addition to the graphics may have been a text box indicating the value of the hazard ratio (HR) of 0.68, the 95% confidence interval (0.50–0.92) and the corresponding p-value (0.01).

11.7

When things go wrong

Even in the most carefully planned and conducted clinical trials things can go wrong. Some of these may be mistakes made by the design team in the original concept, but others may arise through unforeseen circumstances. Whatever their importance, the wisest thing for the writing team to do is to admit them, explain how they have arisen and discuss how they might have influenced the conclusions drawn. The worst one can do (and this is essentially dishonest in any event) is to try to camouflage such occurrences, perhaps hoping the referees will not spot them if they cannot be entirely concealed. In Section 11.4.6, we gave one example where things went wrong. In that case, the (computerised) minimisation randomisation process in the trial of Kahn, Fleischhacker, Boter, et al. (2008) malfunctioned resulting in a lack of balance in the numbers

11.8 CONCLUSIONS

259

randomised to the five intervention groups. Nevertheless, their paper has been accepted for publication by a reputable journal, and so the referees and editors must have judged that the technical problem had not compromised the reliability of the trial conclusions. A potentially more serious problem arose in the randomised double-blind trial of the use of tamoxifen in patients with advanced (inoperable) hepatocellular carcinoma conducted by Chow, Tai, Tan, et al. (2002). The possible problem with the placebo and tamoxifen tablets was only discovered after the analysis was complete and the results had indicated a reverse trend than the one anticipated by the design. Thus, the results appeared to show that high dose tamoxifen carried an adverse survival outcome compared to placebo with an intermediate dose giving an intermediate survival outcome. This raised the possibility that the labelling for the double-blind code had become switched in some way. In fact, this was not the case but it was found, after crushing and examining unused but still packaged tablets from the different centres involved, that somewhere in the production-to-packaging process some batches of placebo and active had been switched. The investigation lead by a senior medical statistician concluded that despite this contamination, the results if anything would underestimate the adverse effect of high dose tamoxifen. All this was explained in the submitted paper and the article accepted for publication in a high impact factor journal.

11.8

Conclusions

We have stressed on several occasions that an important strategy for the trial team is to anticipate what the chosen journal might expect in general terms with respect to many sections of the article which is to be submitted. The major requirement is to summarise the key results and consider the consequences for clinical care and/or research. It is likely to be easy to describe the results of the trial if it addresses what many would agree is an important question and if the outcome provides a clear and unequivocal answer to this. Summarising the results may be more of a challenge when there still remains considerable uncertainty surrounding the conclusions, and this can happen even when a trial is well planned and executed but the results are contrary to expectations. Although the trial protocol will have reviewed the current state of knowledge at the time of planning the trial, this must be updated here with any developments that have been made during the period that the trial has been conducted. In particular, the results should be compared with those from any related clinical trials that may have been published in this interim. It is also important to consider any shortcomings, for example perhaps a larger proportion of patients were lost to follow-up than had been anticipated; however, care should be taken to ensure that such limitations are reviewed in a balanced manner and are not overemphasised to the detriment of the trial’s importance. In many circumstances, the trial being reported will raise further questions perhaps requiring subsequent trials. An indication of what (if any) these might be would be valuable here.

260

11.9

11 REPORTING

Guidelines

11.9.1 General General guidelines for the structure of reports following the completion of a clinical trial are available. These essentially describe the very detailed requirements necessary to support, if appropriate, the documentation needed for regulatory approval of a test product for subsequent clinical use. Nevertheless, even for a trial not seeking such approval, they provide a useful checklist of key features that need to be included in any published clinical trial report. ICH E3 (1996). Structure and Content of Clinical Study Reports. CPMP/ICH/137/95. http://www.ich.org. 11.9.2 Editorial Many journals provide online versions of guidelines for authors. Journal of Clinical Oncology: http://jco.ascopubs.org/, contains sections on: conflicts of interest; authorship contributions; clinical trial registration; and statistical guidelines. British Medical Journal (2018): bmj.com/about-bmj/resources-authors contains guidance for authors when preparing their submission. 11.9.3 CONSORT The original CONSORT statement was intentionally generic and did not consider in detail specific types of trials. However, extensions to the CONSORT statement have been developed for specific trial designs such as: non-inferiority and equivalence, cluster randomised designs, reporting of abstracts, data on harm, herbal medicine interventions, non-pharmacological interventions, pragmatic trials, acupuncture, social and psychological interventions, crossover, pilot and feasibility and others. Up-to-date guidelines can be found on the CONSORT website www.consort-statement.org. In particular, the ‘CONSORT 2010 checklist of information to include when reporting a randomised trial’ is given. Useful references are: Boutron I, Moher D, Altman DG, Schultz KF and Ravaud P (2008). Extending the CONSORT statement to randomized trials of nonpharmacologic treatment: explanation and elaboration. Annals of Internal Medicine, 148, 295–309. Campbell MK, Piaggio G, Elbourne DR and Altman DG (2012). CONSORT2010 statement: extension to cluster randomised trials. BMJ, 345, e5661. Gagnier JJ, Boon H, Rochon P, Moher D, Barnes J and Bombardier C (2006). Reporting randomized, controlled trials of herbal interventions: an elaborated CONSORT statement. Annals of Internal Medicine, 144, 364–367. Grant S, Mayo-Wilson E, Montgomery P, Macdonald G, Michie S, Hopewell S and Moher D (2018). CONSORT-SPI 2018 explanation and elaboration: guidance for reporting social and psychological intervention trials. Trials, 19, 406.

11.9 GUIDELINES

261

Hopewell S, Clarke M, Moher D, Wager E, Middleton P and Altman DG (2008). CONSORT for reporting randomised trials in journal and conferences abstracts. Lancet, 371, 281–283. ICH E3 (1996). Structure and Content of Clinical Study Reports. CPMP/ICH/137/95. http://www.ich.org. Ioannidis JP, Evans SJ, Gøtzsche PC, O’Neill RT, Altman DG, Schultz K and Moher D (2004). Better reporting of harms in randomized trials: an extension of the CONSORT statement. Annals of Internal Medicine, 141, 781–788. Moher D, Hopewell S, Schulz KF, Montori V, Gøtzsche PC, Devereaux PJ, Elbourne D, Egger M and Altman DG (2010). CONSORT 2010 exploration and elaboration: updated guidelines for reporting parallel group randomised trials. BMJ, 340, c869. https://doi.org/10.1135/bmj.c869. Moher D, Schultz KF and Altman DG (2001). The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomised trials. Lancet, 357, 1191–1194. Piaggio G, Elbourne DR, Pocock SJ, Evans SJW and Altman DG (2012). Reporting of noninferiority and equivalence randomized trials: extension of the CONSORT 2010 statement. Journal of the American Medical Association, 308, 2594–2604. Zwarenstein M, Treweek S, Gagniere JJ, Altman DG, Tunis S, Haynes B, Oxman AD and Moher D (2008). Improving the reporting of pragmatic trials: an extension of the CONSORT statement. BMJ, 337, 1223–1226. 11.9.4 Statistical standards Complementing the CONSORT statement which emphasises the need to describe the participant flow through the trial process, guidance on basic statistical reporting for articles published in clinical medical journals is given by Lang and Altman (2013). This can be accessed, and reprinted without charge, through the BMJ website: www.bmj. com. The authors also refer to earlier guidelines concerning statistical reporting. Guidelines are clearly useful for those designing trials and who will eventually become the authors of any potential publication and are thereby exposed to the peer review system of the intended journal. To give one example of a specific requirement relating to randomised controlled trials, guidance specifies that confidence intervals (on the treatment effect size) are to be given for the main results; supplementing the p-value from the associated hypothesis tests. A number of checklists to aid critical reading of randomised trial reports are given by the EQUATOR network (http://www.equator-network.org) and by CASP (http://caspuk.net/casp-tools-chocklists).

PART II

Adaptions of the Basic Design

CHAPTER 12

More Than Two Interventions

In this chapter, extensions of the basic parallel two-group randomised trial are considered. These include parallel designs of three or more groups, including those comparing each of several interventions with a standard which may be a placebo, those comprising different doses of the same compound, and those with no structure in the groups to be compared. The specific situation of the factorial design in which more than one type of intervention comparison can be included in the same trial is also described. Methods of analysis, and for estimating the numbers of subjects to be recruited to such trials, are outlined.

12.1

Introduction

In a design of g > 2 groups, sometimes the interventions being compared are unrelated and the design may be described as unstructured. In other situations, there may be partial structure in that a specific comparison of a subset of the g therapeutic options may be of interest. Structured designs include cases where there are g − 1 therapies under test, but each one is to be compared to the current standard approach to treatment. In others, there may be the possibility of investigating a dose–response relationship. When more than one aspect of therapy is under consideration, for example, a choice of surgical approach with simultaneously the choice of anaesthesia; then, a trial may include these two aspects in a so-called factorial design. In general, these extended designs are more difficult to conduct, for example by needing a more complex protocol, by making the informed consent process more involved and lengthier. Nevertheless, these disadvantages may be compensated by the additional insights that the results from such designs may bring.

Randomised Clinical Trials: Design, Practice and Reporting, Second Edition. David Machin, Peter M. Fayers, and Bee Choo Tai. © 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.

266

12.2

12 MORE THAN TWO INTERVENTIONS

Unstructured comparisons

12.2.1 Hypotheses An unstructured trial design of g > 2 groups may wish to test totally different approaches to the treatment of a disease or condition and for which no standard approach has been established. In this case, assuming the outcome will be summarised by a response rate for each group; then, the null hypothesis is H0: π A = π B = = π g. However, there is a whole range of possible alternative 2-sided hypotheses. For example, for g = 3 these are HAlt1: π A π B π C, HAlt2: π A = π B π C, HAlt3: π A = π C π B, and HAlt4: π A π B = π C. Despite these four alternatives, the context of the trial under consideration may suggest that only one or two of these is appropriate. 12.2.2 Analysis To test any one of the hypotheses above, the regression term βTreatx of (8.13) may have to be replaced by an appropriate dummy variable format depending on the form of the alternative hypothesis, HAlt, concerned. For HAlt1: π A π B π C, two dummy variables are required perhaps (vA = 1, vB = 0) if intervention A is allocated, (vA = 0, vB = 1) if B is allocated, and automatically C is (vA = 0, vB = 0) as described in Figure 8.15. This results in regression analysis for the intervention having the format y = β0 + βTreatA vA + βTreatB vB

(12.1)

Although not strictly necessary, it may be convenient to specify a dummy variable for all intervention groups, in our example vC. This then describes interventions A, B and C by the triples (1, 0, 0), (0, 1, 0) and (0, 0, 1), respectively. On the other hand, HAlt2: π A = π B π C suggests that interventions A and B will act in the same way but may differ from C. This is now a two-group comparison, with A and B coded (vAB = 1, vC = 0) and C (vAB = 0, vC = 1) with associated model y = β0 + βTreatABvAB. Had the three dummy variables been created then, for this situation, the two groups are identified by (vA = 1, vB = 1, vC = 0) and C (vA = 0, vB = 0, vC = 1). However, as this is now a binary comparison the structure of the model remains as in Equation (8.13). In a similar way, testing hypotheses HAlt3 and HAlt4 also involve binary comparisons.

12.2 UNSTRUCTURED COMPARISONS

Example 12.1

267

Postmenopausal women with early breast cancer

The ATAC (Arimidex, Tamoxifen Alone or in combination) Trialists’ Group (2002). The trial compared Anastrozole (Anas), Tamoxifen (Tam) and both combined (Comb) for the adjuvant treatment of postmenopausal women with early breast cancer. One aspect of their trial is the number of serious adverse events reported (r) from the women (n) in each treatment group. The corresponding proportions (p = r/n) are summarised in Figure 12.1a. To test the null hypothesis that the three population proportions are equal against the alternative that they each differ, the dummy variables of Figure 12.1b are created followed by the logistic regression command (blogit r n Anas Tam Comb, or). The computer output points out that not all three dummy variables are required for the analysis so omits (Comb) the last in the list. In effect, it regards the Comb group as the baseline and calculated the odds ratio (OR) for the Anas and Tam treatments relative to that. Thus, as compared to taking Comb, those women taking Tam are at slighter greater risk while those taking Anas are at lower risk. The overall test of the null hypotheses is based on a Chi2 (often denoted χ 2) test with the number of degrees of freedom (df) based on the number of groups, g, being compared minus 1. In our example, g = 3, so df = 2 and Chi2 = 5.60, from which the computer provides the corresponding p-value = 0.061. As this is greater than 0.05, the overall (three-component) null hypothesis is not rejected in this case. Should the investigator wish to express the OR results relative to Tam (the standard treatment in this case) and then (Tam) should be placed as the last dummy in the logistic regression command which becomes (blogit r n Anas Comb Tam, or). In this format, the output indicates slightly lower risks of serious adverse events with Anas and with Comb. To test the null hypothesis against the alternative that Tam and Anas groups are equally at risk, but differ from Comb, Figure 12.1c uses the command (blogit r n Comb, or). In this case, g = 2, so df = 1 and Chi2 = 1.22, from which the pvalue = 0.27. As this is greater than 0.05, the null hypothesis is not rejected although the estimated OR = 1.06 indicates a greater risk in this group. Alternatively, if the combined Anas and Tam groups are to be the single baseline, the command (gen AT=Anas+Tam) provides the dummy variable for this combined group. This corresponding command results in OR = 0.94, which is the inverse of the 1.06 above.

268

12 MORE THAN TWO INTERVENTIONS

Example 12.1 (Continued) (a) list Treat r n p, noobs +-----------------------------+ | Treat r N p | |-----------------------------| | Anas 685 3092 0.222 | | Tam 755 3094 0.244 | | Comb 753 3097 0.243 | +-----------------------------+

(b) HAlt1: πAnas ≠ πTam ≠ πComb list Treat Anas Tam Comb, noobs +--------------------------+ | Treat Anas Tam Comb | |--------------------------| | Anas 1 0 0 | | Tam 0 1 0 | | Comb 0 0 1 | +--------------------------+ blogit r n Anas Tam Comb, or Note: Comb omitted because of collinearity Logistic regression for grouped data Number of obs = 9,283, Chi2(2) = 5.60, Prob > chi2 = 0.061 --------------------------------------------------------z Treat | OR SE P>|z| (95% CI) --------+-----------------------------------------------Anas | 0.8859 0.0534 -2.01 0.044 0.7872 to 0.9969 Tam | 1.0048 0.0595 0.08 0.94 0.8947 to 1.1285 Comb | 1 (omitted) --------------------------------------------------------blogit r n Anas Comb Tam, or --------------------------------------------------------z Treat | OR SE P>|z| (95% CI) --------+-----------------------------------------------Anas | 0.8817 0.0531 -2.09 0.036 0.7835 to 0.9921 Comb | 0.9952 0.0589 -0.08 0.94 0.8862 to 1.1177 Tam | 1 (omitted) ---------------------------------------------------------

(c) HAlt2: πAnas = πTam ≠ πComb blogit r n Comb, or Logistic regression for grouped data Number of obs = 9,283, Chi2(1) = 1.22, Prob > chi2 = 0.27 --------------------------------------------------------z Treat | OR SE P>|z| (95% CI) --------+-----------------------------------------------Comb | 1.0588 0.0546 1.11 0.27 0.9570 to 1.1714 ---------------------------------------------------------

12.2 UNSTRUCTURED COMPARISONS

Example 12.1

269

(Continued)

gen AT=Anas+Tam blogit r n AT, or ---------------------------------------------------------Treat | OR SE z P>|z| (95% CI) --------+------------------------------------------------AT | 0.9445 0.0487 -1.11 0.27 0.8537 to 1.0450 ----------------------------------------------------------

Figure 12.1 Reported serious adverse events in postmenopausal women with early breast cancer by treatment group. Source: Data from The ATAC (Arimidex, Tamoxifen Alone or in Combination) Trialists’ Group (2002).

12.2.3 Selection design One scenario in the unstructured situation is to conduct a randomised trial with the g interventions but with the objective of choosing that with the highest level of activity. This approach then chooses the observed ‘best’ of the treatments however small the advantage is over the others. The trial size required using this design, developed by Simon, Wittes and Ellenberg (1985), is determined in such a way that if an intervention exists for which the underlying efficacy is superior to the others by a specified amount; then, it will be selected with a high probability. For a binary endpoint, when the difference in true response rates of the ‘Best’ and the ‘Next Best’ intervention is δ, then the probability of correctly selecting (CS) the best, PCS, is smallest when there is a single Best and the other g − 1 are of equal but lower efficacy. For planning purposes, the design requires the response rate of the anticipated least effective intervention to be specified, denoted π Small, and the required value of PCS.

Example 12.2

Selection design – advanced non–small cell lung cancer

Leong, Toh, Lim, et al. (2007) conducted a randomised trial of g = 3 single agents Gemcitabine (G), Vinorelbine (V) and Docetaxel (D) for the treatment of elderly and/or poor performance status patients with advanced non–small-cell lung cancer. The design was implemented with the probability of correctly selecting the best treatment assumed to be 90% (PCS = 0.9). It was anticipated that the single-agent activity of each drug has a baseline response rate of approximately 20% (π Small = 0.2). In order to detect a δPlan = 0.15 (15%) superiority of the best drug over the others, Table T5 gives a sample size of n = 44 per drug or a total number of patients to be recruited is N = gn = 3 × 44 = 132. In the event, 43 G, 45 V, and 46 D patients were randomised to the drugs with the corresponding response rates of 16, 20 and 22% which were all below the set threshold of π Small + δPlan = 0.2 + 0.15 = 0.35 (35%).

270

12 MORE THAN TWO INTERVENTIONS

12.2.4 Trial size It is beyond the scope of this text to describe the technical details of the Simon, Wittes and Ellenberg (1985) methodology which are also given by, for example, Machin, Campbell, Tan and Tan (2018). However, Table T5 enumerates sample sizes for a range of situations.

12.3

Comparisons with placebo (or standard)

12.3.1 Design In certain situations, there may be several potentially active treatments under consideration, each of which would be desirable to test against a standard (possibly a placebo) to ascertain their effectiveness. The treatments considered may be entirely different formulations (not different doses of the same compound), and one may be merely trying to determine which, if any, are active relative to the current Standard (S) rather than to make a comparison between them. This can be expressed for a binary outcome in terms of testing each null hypothesis, H0: π S − π Ti = 0, against each alternative hypothesis HAi: π S − π Ti 0, where i corresponds to each of the g − 1 Test (Ti) treatments. A more practical approach in such cases is to set a common minimum effect size, δImportant, which is deemed to be clinically important by the investigating team for all the comparisons. Any treatment that demonstrates this minimum level would then be considered as ‘sufficiently active’ and perhaps evaluated further in subsequent trials.

Example 12.3 Comparison with a standard – schizophrenia Kahn, Fleischhacker, Boter, et al. (2008) describe a randomised trial comparing four second-generation antipsychotic drugs: Amisulpride (A), Olanzapine (O), Quetiapine (Q) and Ziprasidone (Z), with the first-generation drug Haloperidol (H) in patients with first episode of schizophrenia and schizophreniform disorder. In this example, the comparison is not with a placebo but with the first-generation drug, H. A total of 498 patients were randomised with a 1 : 1 : 1 : 1 : 1 allocation ratio using a minimisation procedure such as we have described in Chapter 5. The endpoint was the time from randomisation to discontinuation of the treatment allocated (the event) and the Kaplan–Meier estimates of the 12-month discontinuation rates were A 40%, O 33%, Q 53%, Z 45%, respectively, compared to H 72%.

12.3.2 Trial size Although for parallel-group design, the conventional approach would be to randomise the Ti treatments and S (g options) equally, perhaps in blocks of size b = g or 2g, Fleiss (1986, pp. 95–96 and 115–116) has shown for a continuous outcome that in this situation it is better to have a larger number of patients receiving S than each of the Ti interventions. This is because every one of the g − 1 comparisons is made against S so that its effect needs to be well established. Accordingly, the S group should have √(g − 1) patients for every one patient of the other options. For example, if g = 5, then

12.3 COMPARISONS WITH PLACEBO (OR STANDARD)

271

√(g − 1) = √(5 − 1) = 2. In such a case, the recommended randomisation ratios are 2 : 1 : 1 : 1 : 1. This can be achieved by creating in blocks of size b = 6 or 12. However, if g = 6 for example, then √(6 − 1) = 2.24 which is not an integer but with convenient rounding this leads to a randomisation ratio of 2.5 : 1 : 1 : 1 : 1 or equivalently 5 : 2 : 2 : 2 : 2. The options can then be randomised in blocks of size, b = 13 or 26. In general, the sample size calculation begins by estimating the sample size, NTwo, required for a two-group trial of S versus T, with a clinically important effect size specified, and assuming numbers are to be allocated in the ratio of 1 : φ where φ=1

g −1

(12.2)

When the variable being measured is continuous, then the appropriate size can be obtained from a modified Equation (9.5) (here ignoring the small sample correction) in the form: N Two =

From which nS =

N Two

g −1

1+

g −1

1+

g −1 g −1

and nT =

2

z1 − α

2

+ z1 − β

Δ2Plan

2

,g ≥ 2

(12.3)

N Two . The total sample size required Ng = nS + 1 + g −1

(g − 1)nT = √(g − 1) NTwo. A similar adjustment is necessary in the cases of the binary, ordered categorical and time-to-event outcomes of Equations (9.6), (9.8) and (9.13), respectively.

Example 12.4

Comparison with placebo – heroin dependence

Schottenfeld, Charwarski and Mazlan (2008) designed a three-arm trial of buprenorphine (Bu), naltrexone (Na) and placebo (P) to investigate alternative maintenance treatments to reduce problems associated with heroin dependence. On the basis of earlier studies, they anticipated that the Kaplan–Meier estimates of the proportion abstinent at 6 months would be π Na = 0.72 and π Bu = 0.42 for Na and Bu, respectively. For P, they considered that π P might range from 0.12 to about 0.24. A conservative planning value would take π P = 0.24. Using Equation (9.10), this leads to the anticipated effect size for Na versus P as, HRNa-P = log 0.72/log 0.24 = 0.23, for Bu versus P HRBu-P = log 0.42/log 0.24 = 0.61 and for Na versus Bu HRNa-Bu = log 0.72/log 0.42 = 0.38. On this basis, the investigators specified what they termed a ‘medium effect size’ equivalent to HRPlan = 0.5, which is a compromise between 0.23, 0.61 and 0.38 above. This assumption implies that the design comparison used compares π P = 0.24 and π Active = exp [HRPlan × log(π P)] = exp[0.5 × log(0.24)] = 0.49. The calculations begin by assuming a two-group design with allocation ratio, φ = 1. Then with a 2-sided test of size 5% and power 80%, use of Table T3 and

272

12 MORE THAN TWO INTERVENTIONS

Example 12.4

(Continued) 2

2

Equation (9.13) gives N Two = 2 11 −+ 00 55 1 −109624+ +0 8416 = 111.24 or approxi1 − 0 49 mately 56 patients in each of the two groups. As the investigators had planned, this implies for their g = 3 arm trial, recruiting 56 × 3 = 168 patients. In this case, if randomised equally to the three arms (Na, Bu and P), this could be implemented in blocks of possible size b = 3, 6 or 12. However, the approach of Fleiss (1986) with g = 3, begins the sample size calculation by comparing two groups but now randomised in the ratio of 1: φ = 1/√(3 − 1) = 0.7071. In which case, Equation (12.13) leads to nP = 72.58, nNa = nBu = 51.33 and hence N = 72.58 + (2 × 51.33) = 175.24. Practical considerations suggest that randomising in the ratios of 1 : 0.7071 : 0.7071 would be logistically difficult so a ratio of 1 : 0.75 : 0.75, alternatively expressed as 4 : 3 : 3, may be better. The consequences of setting φ = 0.75 are that now nP = 69.27 and nNa = nBu = 51.95 but these are little changed from the previous values. To allow blocks of size 10 (4 + 3 + 3) to be constructed nP is first rounded up to 72 to be divisible by 4, so that nNa = nBu = 3 × (72/4) = 54. The total number of subjects to be recruited then becomes N = 72 + (2 × 54) = 180 to be randomised in b = 18 blocks of size 10. Further, in this example, a 1-sided significance test may be more appropriate as it is likely that only an improvement in abstinence of Bu and Na over P would be of interest. In which case, z1 − α/2 = 1.96 is replaced by z1 − α = 1.6449 in Equation (9.13) to give nP = 54.56 and nNa = nBu = 40.92 and hence N = 54.56 + (2 × 40.92) = 136.40. With a block size b = 10, this might be rounded to recruit 140 heroin users with P allocated 56 and 42 to each of the Bu and Na options.

12.3.3 Analysis The analysis of such a design involves comparing each of the i = 1, 2, …, g − 1, Ti options against the S. For a continuous endpoint variable, the t-test of each comparison takes the form: t=

yS − yTi , SE0 yS − yTi

(12.4)

and SE0 yS − yTi = sPool

1 1 + = sPool nS nT

nT

1 1 + g − 1 nT

(12.5)

However, in contrast to Equation (8.8) which concerns only the SDA and SDP, sPool requires an extension of that equation to include SDl, SD2, …, SDg. As a consequence, the corresponding degrees of freedom are df = (nS − 1) + [(g − 1) × (nT − 1)].

12.3 COMPARISONS WITH PLACEBO (OR STANDARD)

273

Example 12.5 Comparison with a placebo – intellectual disability Tyrer, Oliver-Africano, Ahmed, et al. (2008) give the median and range of quality of life scores of Figure 12.2 following 4 weeks of treatment with either placebo (P), risperidone (R) or haloperidol (H) for aggressive challenging behaviour in patients with intellectual disability. The authors concluded that patients given P showed no evidence at any time points of worse response than did those assigned to either of the antipsychotic drugs. They therefore recommended that: Antipsychotic drugs should no longer be regarded as acceptable routine treatment for aggressive challenging behaviour in people with intellectual disability.

As we do not have access to the data, and indeed the trial design was not of the form suggested by Fleiss (1986), we have produced simulated trial data which is summarised in the lower panel of Figure 12.2. As will therefore be clear, our analysis summarised in Figure 12.3, cannot be taken as any guide to the relative merits of the treatments concerned. QoL score

Placebo (P)

Risperidone (R)

Haloperidol (H)

Actual data of Tyrer, Oliver-Africano, Ahmed, et al (2008) Number of patients

29

29

28

Median

72

70

66

Range

65.7–77.75

60–78

59.5–75.5

Simulated data for the Fleiss (1986) design Number of patients

42

29

28

Mean

72.13

70.73

65.95

Pooled

Standard deviation

2.56

4.01

4.31

3.57

Figure 12.2 Median and range of quality of life (QoL) scores at 4 weeks together with simulated data based on these to mimic a trial using the Fleiss (1986) design. Source: Data from Tyrer, Oliver-Africano, Ahmed, et al. (2008).

The tabulation of Figure 12.3a reproduces that of the simulated data in Figure 12.2. An overall comparison of the differences between the three groups is conducted using analysis of variance (ANOVA) which, in this example, tests the overall null hypothesis H0: μP = μR = μH = 0. Thus, Figure 12.3b uses the command (anova QoL GROUP) for this. It uses the pooled standard deviation obtained from the three groups as indicated by the square root of MS (termed MSE in the computer output) to give sPool = 3.57 and also p-value = 0.0001 suggesting a statistically significant difference between groups. The command structure (regress QoL i.GROUP), given in Figure 12.3c, recognises (GROUP) as a three-level unordered categorical variable.

274

12 MORE THAN TWO INTERVENTIONS

Example 12.5

(Continued)

(a) Simulated data – Tabulation command and Output table GROUP, contents(n QoL mean QoL sd QoL) row ---------------+----------------------------GROUP | N(QoL) Mean(QoL) SD(QoL) ---------------+----------------------------Placebo (P)| 42 72.13 2.56 Risperidone (R)| 29 70.73 4.01 Haloperidol (H)| 28 65.95 4.31 ---------------+-----------------------------

(b) Testing the null hypothesis H0: πP = πR = πH = 0 anova QoL GROUP Number of obs = 99 Root MSE = 3.5659 -----------+-----------------------------------------------Source | SS df MS F p-value -----------+-----------------------------------------------GROUP | 665.44 2 332.72 26.17 0.0001 Residual | 1220.71 96 12.72 (= 3.572) -----------+-----------------------------------------------Total | 1886.15 98 -----------+------------------------------------------------

(c) Comparing R and H with P regress QoL i.GROUP --------------------------------------------------------------QoL | Coef SE t P>|t| (95% CI) ----------------+---------------------------------------------Placebo (P) | 72.13 0.55 Risperidone (R) | -1.39 0.86 -1.62 0.109 -3.10 to 0.32 Haloperidol (H) | -6.18 0.87 -7.10 0.001 -7.91 to -4.45 ---------------------------------------------------------------

Figure 12.3 Edited Stata commands and output for the analysis of quality of life scores using the simulated data for the lower section Figure 12.2. Source: Based on Tyrer, Oliver-Africano, Ahmed, et al. (2008).

The regression coefficients of −1.39 and −6.18 correspond to the differences in means between R and P, and H and P. They suggest that QoL may be statistically significantly lower (p-value = 0.001) with H but not with R (p-value = 0.109). The corresponding means are calculated as follows: QoLP = 72.13, QoLR = 72.13 − 1.39 = 70.74 and QoLH = 72.13 − 6.18 = 65.95, which are given in Figure 12.3a and, apart from a small rounding error, are those of the lower panel of Figure 12.2.

12.4 DOSE–RESPONSE DESIGNS

12.4

275

Dose–response designs

12.4.1 Design In some situations, although the test intervention remains intrinsically of the same type, it may be activated at increasing levels of intensity. Thus, in a randomised double-blind trial conducted by Chow, Tai, Tan, et al. (2002) two doses of Tamoxifen (TMX) were tested against placebo (TMX0) in patients with inoperable hepatocellular carcinoma. The reasoning for the two doses (TMX60 and TMX120) was that, although the lower of these was in general use for palliation of symptoms, the higher dose had been suggested as likely to be more beneficial. A similar, dose increasing, design was used by Stevinson, Devaraj, Fountain-Barber, et al. (2003) who compared two doses of homeopathic arnica against placebo to examine any dose–response relation. The report of their trial suggested that, irrespective of dose, there was no advantage to homeopathic arnica over placebo in the prevention of pain and bruising following hand surgery. However, in the analysis, no consideration of the increasing dose levels 0, 6 and 30 of homeopathic arnica appears to have been taken.

Example 12.6

Dose–response – rheumatoid arthritis

Smolen, Beaulieu, Rubbert-Roth, et al. (2008) compare Tocilizumab (T) in two doses of T4 and T8 mg/kg against placebo (T0) to test the therapeutic effect of blocking interleukin in patients with rheumatoid arthritis. They recruited a total of 623 patients randomised equally to the three arms. Their results suggest a dose– response in favour of T with respect to several American College of Rheumatology (ACR) criteria. The primary efficacy endpoint was the proportion of patients with 20% improvement in rheumatoid arthritis signs and symptoms (ACR20) giving response rates of 54/204 (26%) for T0, 101/214 (47%) for T4 and 120/205 (59%) for T8 confirming evidence of an increasing dose–response relation.

12.4.2 Trial size In the simplest design situation, the trial would have g different dose levels, dj, where j = 0, 1, 2, …, g − 1. Then, if the dose–response is linear and the endpoint of interest is a continuous measure, this situation can be summarised by the following (linear regression) model

276

12 MORE THAN TWO INTERVENTIONS

yij = β0 + βDose d j + εij

(12.6)

Here, yij is the outcome measure for patient i receiving dose j, and εij is the corresponding random error term. In this equation, βDose represents the slope of the linear dose– response relation and replaces βTreat of Equation (2.1) as the main focus for the statistical analysis. For expository purposes, we assume the same number of subjects, m, will receive one of the g different doses and the standard deviation, σ, is constant within each dose group. Once the trial is completed, the slope of the fitted linear regression equation is estimated by g −1

bDose =

m

i=0 j=1 g −1

m

di − d yij ,

(12.7)

2

di − d

i=0 g −1

where d =

di g. The corresponding standard error of bDose is i=0

σ m

SE bDose =

1 g −1

(12.8)

di − d

2

i=0

These can then be used in the fundamental Equation (9.3), adjusted for a 2-sided test, by replacing 2σ 2 by σ 2 is now expressed by

g −1 i=0

di − d

2

and δ by βDose. Further, the standardised effect size

βDose Δ=

g −1

di − d

2

i=0

(12.9)

σ

This gives the number of subjects randomised at each dose level as:

m=

z1 − α

+ z1 − β Δ2

2

2

(12.10)

Thus, for specified dose levels, βDosePlan and σ Plan, and hence ΔPlan, the total trial size will then require N = gm subjects.

12.4 DOSE–RESPONSE DESIGNS

277

In the special case, where the doses are equally spaced, they can then be coded as, 0, 1, 2, …, g − 1, so that SE bDose =

σ m

N = gm = g ×

12 g g2 − 1

and consequently

12 σ 2Plan z 1 − α 2 + z1 − β g g2 − 1 β2Plan

2

(12.11)

We note that βDosePlan is the planning regression slope which might be anticipated in practice as the difference between the anticipated endpoint measures at the lowest dose (often 0, y0) and the highest dose (yMax), divided by the range, R = dMax − d0, of the doses to be included in the design.

Example 12.7

Rheumatoid arthritis

For the trial of Smolen, Beaulieu, Rubbert-Roth, et al. (2008, Tables 1 and 4) of Example 12.6, which has equally spaced doses, we can deduce that the mean pain levels assessed by VAS (mm) at 24 weeks were approximately 45, 36 and 30 mm for the g = 3 doses consisting of 0, 4 and 8 mg/kg of T. If a repeat trial was planned, then a reasonable value for the regression slope might be the observed change between 0 and 8 mg/kg from this trial or βDosePlan = (30 − 45)/(8 − 0) = −15/ 2

8 = −1.875. For the three dose levels used D1 = 2i = 0 di − d = 32 and from their Table 1, the corresponding SD is approximately σ Plan = 22, leading from Equation (12.9) to a standardised effect size ΔPlan = − 1 87522× √ 32 = −0.48. Using a 2-sided test size of α = 0.05 and power 1 − β = 0.9, Table T2 and Equation (12.10) give N = 3 ×

1 96 + 1 2816 − 0 48 2

2

= 136.82. To be divisible by g = 3, this is

rounded to N = 138 allocated with m = 46 per dose level. However, as the doses are equally spaced, Equation (12.11) can be used if we recode the dose 0, 4 and 8, to 0, 1 and 2, respectively, to give 2

D2 = 2j = 0 d j − d = 2. This requires the planned slope to be expressed as βDosePlan = (30 − 45)/(2 − 0) = −7.5. In which case, Equation (12.11) then gives N=3×

12 3 × 32 − 1

×

1 96 + 1 2816 − 7 5 22 2

2

= 135.62 or 46 patients per dose. As before, this

gives a planned trial size of N = 3 × 46 or 138 patients in total.

12.4.3 Analysis As in all circumstances, the form of the analysis will depend on the type of endpoint concerned. However, for a binary outcome, the principal hypothesis to test will often

278

12 MORE THAN TWO INTERVENTIONS

be one of linearity on the logit scale of the dose–response relation. Alternatively, if a non-linear response is anticipated, then careful thought is required to determine the associated regression model to describe the relationship.

Example 12.8

Dose–response – plaque psoriasis

Papp, Bissonnette, Rosoph, et al. (2008, Table 3) used the number of patients with 75% reduction in psoriasis at week 12 as one endpoint of their trial when investigating the possible dose–response relationship with the use of the calcineurin inhibitor ISA247. Their results are summarised in Figure 12.4. Dose of ISA247 (mg/kg) Number of patients

0

0.2

0.3

0.4

All

Randomised

115

107

113

116

451

Analysed (N)

113

105

111

113

442

With 75% reduction (r) Rate (%)

4

14

26

44

88

(3.5)

(13.3)

(23.4)

(38.9)

(19.9)

Figure 12.4 Number of patients with 75% reduction in psoriasis at week 12 by increasing dose of ISA247. Source: Data from Papp, Bissonnette, Rosoph, et al. (2008).

The corresponding logistic regression based on the grouped data from the 442 patients uses the command (blogit r N Dose), and the results of such an analysis are given in Figure 12.5. Command blogit r N Dose Edited output Logistic regression for grouped data

Number of obs = 442

---------------------------------------------------------------Outcome | Coef SE z P>|z| (95% CI) ----------+----------------------------------------------------cons | -3.3085 0.3781 Dose | 7.1309 1.1679 6.11 0.000 4.8419 to 9.4199 ----------------------------------------------------------------

Figure 12.5 Edited Stata command and output for the analysis of the proportion of patients achieving a 75% reduction in psoriasis at week 12 of Figure 12.4. Source: Data from Papp, Bissonnette, Rosoph, et al. (2008).

12.4 DOSE–RESPONSE DESIGNS

Example 12.8

279

(Continued)

This output results in the estimated model logit(p) = −3.3085 + 7.1309 × Dose which corresponds, on the proportion who respond scale, to pEstimate = exp (−3.085 + 7.1309Dose)/[1 + exp(−3.3085 + 7.1309Dose)] which is plotted in Figure 12.6 along with the proportions observed from Figure 12.4. In this example, the model describes the data very closely indeed although this will not always be the case.

12.4.4 Reporting Although it may seem obvious, tabular presentation of the results for a dose–response design should reflect the increasing (or decreasing) order of the doses included in the trial. Thus, although Papp, Bissonnette, Rosoph, et al. (2008) tabulate in columns in the order 0 (placebo), 0.2, 0.3 and 0.4 mg/kg, Stevinson, Devaraj, Fountain-Barber, et al. (2003) choose the column order Arnica 6C, Placebo and Arnica 30C, while Smolen, Beaulieu, Rubbert-Roth, et al. (2008) uses 4, 8 and then 0 (placebo) mg/kg. The latter two tabulations make it difficult for a reader to identify patterns along the rows of any of the variables within these tables. The rule for columns in tables should follow that for graphs with plotting of smallest to largest (here dose) values from left to right on the horizontal scale regarded as standard practice. Also, this ordering (or the reverse) is necessary if a statistical test for trend is to be conducted. Where possible present some graphical output to show the dose–response relation such as that in Figure 12.6.

Percentage of patients with 75% reduction in psoriasis score

50

40

30

20

10

0 0

0.1

0.2 0.3 Dose of ISA247 (mg/kg)

0.4

Figure 12.6 Proportion of patients achieving a 75% reduction in psoriasis at week 12 (with corresponding 95% CI) by given dose of ISA247 (mg/kg). Source: Data from Papp, Bissonnette, Rosoph, et al. (2008).

280

12.5

12 MORE THAN TWO INTERVENTIONS

Factorial trials

12.5.1 Design In some circumstances, there may be two distinct therapeutic questions that are posed and both questions may be answered within a single trial by use of a factorial design. In a 2 × 2 factorial design, the two intervention types or factors A and B are each studied at two levels. Because each factor has two levels, the usual notation is to denote, for example for factor A, one level (perhaps the lowest) of that factor by (1) and the highest by (a). Similarly, for factor B, the notation is (1) and (b). Combining these, the four options are (1)(1), (a)(1), (1)(b) and (a)(b) which are finally abbreviated further to (1), (a), (b) and (ab). Also, at the analysis stage, when estimating the so-called Main effect of A and of B from the data they are denoted as A and B, respectively. Two-level factorial trials can be concerned with three factors to give a 2 × 2 × 2 or 23 design, and the structure can be extended to even more factors. Further, two three-level factors may be proposed in a 3 × 3 or 32 design and clearly a mixed design of 3 × 2 is also possible. In all these designs, the lowest level need not be placebo. For example, the two levels may be two types of surgery so that either type may be arbitrarily termed the lowest level.

Example 12.9 Low back pain In a trial conducted by Hancock, Maher, Latimer, et al. (2007) patients with low back pain were given advice, including the suggested use of paracetamol, by their general practitioner. They were then randomised to receive either diclofenac (D) or placebodiclofenac (P) (the two levels of D are denoted (d) and (1), respectively), and additionally were randomised to either spinal manipulative therapy (M) or placebo manipulative therapy (P) (the two levels of M are denoted (m) and (1), respectively). An important feature of this trial was their use of a placebo-controlled doubleblind design. This included placebo-diclofenac (having a placebo to ‘blind’ a drug is not unusual) but also placebo manipulative therapy which must have been more difficult to make ‘blind’. The 2 × 2 possible combinations (1), (d), (m) and (dm) are illustrated in Figure 12.7. The endpoint chosen was patient recovery or not at 12 weeks postrandomisation. The two questions posed simultaneously are the value of D (the main effect D) and the value of M (the main effect M). In addition, this factorial design allows an estimate of the D by M interaction (denoted D.M). An interaction indicates whether the effect of D remains the same whether in the presence or in the absence of M.

12.5 FACTORIAL TRIALS

Example 12.9

281

(Continued) I – Double placebo (1)

Patients with low back pain of less than 6 weeks duration

Random allocation to treatment

II – Diclofenac (d) (with placebo manipulation)

III – Manipulation (m) (with placebo diclofenac)

IV – Diclofenac and Manipulation (dm)

Figure 12.7 Randomised 2 × 2 factorial trial of diclofenac and spinal manipulation as adjunct to advice and paracetamol in patients with low back pain. Source: Based on Hancock, Maher, Latimer, et al. (2007).

Example 12.10

23 factorial design – cleft lip and palate

Williams, Seagle, Pegoraro-Krook, et al. (2011) conducted a randomised trial using a 23 factorial design in 376 infants with cleft lip and palate. Consequently, the infants were assigned to one of eight different groups defined by two different lip (L) repairs using the Spina or Millard approaches, two palatal (P) repairs using either the von Langenbeck or Furlow techniques, with surgery performed at two different ages (A) (9–12 and 15–18 months). One aspect of their results suggested better velopharyngeal function for speech with the Furlow palate repair technique.

12.5.2 Randomisation Patients eligible for 2 × 2 factorial trials are randomised to one of the four treatment options in equal numbers. The reason that equal allocation is chosen is that this enables the most statistically efficient analysis to be undertaken. Thus, it is particularly

282

12 MORE THAN TWO INTERVENTIONS

important that the four treatment options are kept approximately balanced throughout the progress of the trial. This is often done using blocks of size b = 4, 8 or 12. The randomisation methods of Chapter 5 extend relatively easily to this more complex design situation. For example, if the design involves the four combinations, labelled for convenience, A, B, C and D, then each of these could be allocated the successive digit pairs: 0–1 : 2–3 : 4–5 and 6–7, respectively. If an 8 or 9 occurs in the random sequence chosen, then these are ignored as there is no associated intervention for these digits. Using simple randomisation, the sequence 534 554 25 would ascribe the first eight trial recruits to CBC CCC BC, respectively. In this sequence, the first eight subjects of the allocation would receive: A 0, B 2, C 6 and D 0. This is clearly not a desirable outcome as no subject is allocated to either of the interventions A or D. A more satisfactory method is to first generate all the 24 possible sequences for the order of the four treatments ranging from ABCD to DCBA as in Figure 12.8. Once achieved, then each pair of the chosen randomisation sequence, that is, 53, 45, 54, 25, etc. can be used. To do this, as the first pair 53 in the sequence is not contained in Figure 12.8, it is reduced by modulo 24 the number of different sequences. Thus, since 53 = (2 × 24) + 5, the remainder from the division, which is 05, is then used to choose the sequence ADCB. In this way, the random sequence 53 45 54 25 is converted into 05, 21, 06 and 01. Then, as the sequences are numbered from 00 to 23, this generates for the first 16 patients the randomisation sequence: ADCB DBCA BACD ABCD. This clearly produces equal numbers in each group after every 4, 8, 12, 16 patients and so on. In the report of the Hancock, Maher, Latimer, et al. (2007) trial illustrated in Figure 12.7 concerned with the treatment of low back pain, the authors stated that randomisation was done with randomly permuted blocks of size 4, 8 and 12 for the 240 patients recruited. However, they did not detail how many blocks of the different sizes were utilised although this additional piece of information is really necessary for a satisfactory description of the processes involved. In Section 2.7, it was emphasised that the interventions should be initiated as soon after randomisation as possible, although this may not be possible in all cases.

00

ABCD

06

BACD

12

CABD

18

DABC

01

ABDC

07

BADC

13

CADB

19

DACB

02

ACBD

08

BCAD

14

CBAD

20

DBAC

03

ACDB

09

BCDA

15

CBDA

21

DBCA

04

ADBC

10

BDAC

16

CDAB

22

DCAB

05

ADCB

11

BDCA

17

CDBA

23

DCBA

Figure 12.8

All possible permuted blocks of size 4

12.5 FACTORIAL TRIALS

Example 12.11

283

Cleft palate repair

We referred briefly in Section 2.6 to a trial described by Yeow, Lee, Cheng, et al. (2007) which has a 2 × 2 factorial design in which one factor is the comparison of two alternative forms of surgery for cleft palate, while the other is whether this should be performed at 6 or 12 months of age. In this trial, randomisation was carried out at 6 months of age, so for one group the surgery is immediate, while for the other a 6-month delay is imposed by the design which is posing the question regarding the best timing for surgery. If a child is allocated to the immediate surgery group, there is little opportunity either to withdraw from the trial or to request the other surgical procedure. In contrast, if allocated to surgery at 12 months, then this allows a long period (6 months) for the patient (essentially the caregiver in this case) to consider withdrawal from the trial or to request the other surgical option. Thus, it is likely that there will be different withdrawal and compliance rates amongst these two groups. Indeed, when Yeow, Young, Chen, et al. (2019) reported the results of the trial just described, the majority of the withdrawals occurred amongst those infants randomised to palatal surgery at 12 months.

Williams, Seagle, Pegoraro-Krook, et al. (2011) used a block randomisation structure for their 23 factorial design, which also included choice of timing of surgery (essentially two infant age groups), such that each sequential set of 8 children assigned to a surgeon was assigned to the eight study groups at random.

They give no indication of similar consequences with respect to withdrawal in the later timing group of age 15–18 months. Nevertheless, a better strategy for randomisation in these circumstances might be to randomise first to surgery at 6 or 12 months. Then, if 6 months is allotted, immediately initiate the second randomisation to type of surgery. In contrast, if 12 months is allotted, delay randomising the type of surgery until immediately before it is due at 12 months. This example illustrates that great care should always be taken when selecting a suitable randomisation strategy for the design in question. 12.5.3 Trial size In a2 × 2 factorial trial with a continuous outcome measure, there are four means to be estimated, each from n subjects within the respective group. However, when estimating the influence of each factor (the main effect), one is comparing two means each based on 2n observations. These two analyses (since there are two factors) make the assumption that there is no interaction between them, that is, the effect of factor A (say) is the same irrespective of which level of factor B (say) is also given to the patient.

284

12 MORE THAN TWO INTERVENTIONS

Suppose the 2 × 2 factorial trial compares two factors, D and M, as in the trial of Hancock, Maher, Latimer, et al. (2007) involving patients with low back pain then we recommend planning trial size in several stages. The first step would be to consider the sample size for factor D. The second would be to consider the sample size for factor M which may have a specific effect size, test size and power that are different from those for the factor D comparison. Clearly, if the sample sizes are similar for the two factors, then there is no difficulty in choosing the larger as the required sample size. If the sample sizes are very disparate, then a discussion would ensue as to the most important comparison and perhaps a reasonable compromise reached. For example, if the larger trial size is required for the more secondary question, then a mid-way sample size might be favourably considered. In contrast, should the larger trial size be for the primary question, there may be some reluctance to compromise.

Example 12.12

Chronic obstructive pulmonary disease

Calverley, Pauwels, Vestbo, et al. (2003) used a 2 × 2 factorial design to investigate the combination of Salmeterol (S) and Fluticasone (F) (with each drug having a corresponding placebo option) for patients with chronic obstructive pulmonary disease. The four treatment groups were therefore (1), (s), (f) and the combination (sf). The trial was double-blind, and the endpoint was FEV1 assessed 1 year from randomisation. When planning their trial, they used a planning difference of δPlan = 0.1 L with SDPlan = 0.35 L, to give a standardised effect size of ΔPlan = 0.1/0.35 = 0.29. They anticipated a withdrawal rate of 20%. For a 2-sided test size of 5% and power 90%, Equation (9.5) (but omitting the final term) gives an initial estimate of sample size as N0 = 514.86. This is then rounded upwards to N = 516 so as to be divisible by 4 to enable randomisation to equal numbers per group. Inflating this by 20% to allow for withdrawals increases the suggested total number of subjects to recruit to approximately 620 to give n = 155 in each group. With the above numbers in each group, the main effect S will eventually be estimated by combining the groups (1) and (f) and comparing that with the combined group (s) and (sf). Thus, the comparison is then based on 310 patients in each of these combination groups. However, the investigators appear to have determined sample size by, for example, comparing (s) with (1) to obtain 620 patients and then assigning 310 to (1) and 310 to (s) and from this implying that 310 are also required for (f) and (sf) making a total of 1240 patients. This approach neglects the advantage of their basic 2 × 2 factorial structure and leads to a trial of twice the required sample size.

12.5 FACTORIAL TRIALS

285

12.5.4 Analysis 12.5.4.1 Comparing means The analysis for a 2 × 2 factorial design involves essentially three stages. The first two involve the calculation of the two main effects (say) A and B. In each case, if the endpoint is continuous, calculating the difference between the means of all the patients treated at one level of factor A, that is (1) and (b), versus those who receive the other level of factor A, that is (a) and (ab), to obtain the main effect A. Also, a statistical test of the null hypothesis may be conducted to obtain the corresponding p-value and (often more important) the 95% CI for that difference. The same process is repeated for factor B to obtain the main effect B. The third stage is to determine the interaction which involves first combining the data from the group that receives the First levels of A and B, that is (1), with the group receiving the Second level of both A and B, that is (ab), to give the combination (1) + (ab). This group is then compared with the combined group (a) + (b). Then, just as for the main effects, the difference in the two means of each group is then calculated to obtain the interaction effect A.B. A statistical test of the null hypothesis can then be conducted, the p-value obtained and the 95% CI for that interaction calculated. In brief terms, although some judgement is required, if A.B is not statistically significant, then it can be ignored when interpreting the results of the trial in terms of the effects of A and B. On the other hand, if the interaction is statistically significant, then this implies that even if (say) A is also statistically significant, the actual magnitude of that difference will differ depending on the associated level of factor B. For example, the difference may be smaller with First level B and larger with Second level B.

Example 12.13

Chronic obstructive pulmonary disease

The reported results from the 2 × 2 factorial design trial of Calverley, Pauwels, Vestbo, et al. (2003) give the mean values for (1), (s), (f) and (sf) as 1264, 1323 1302 and 1396 mL, respectively, based on information from approximately n = 360 patients per group. From the reported confidence intervals, it can be deduced that SD ≈ 265 mL so that the SE(Mean) = SD/√n = 265/√360 = 13.97. From this information, the main effect S can be estimated by subtracting from the mean value of those combinations of which S is a member, that is (s) + (fs), the mean of those for which S is not a member, that is (1) + (f). Hence, S = 1323 +2 1396 − 1264 +2 1302 = 1359.5 − 1283.0 = 76.5 mL. Each of the two terms in this calculation is a mean of two means. Thus, both 1359.5 and 1283.0 have the same SE =

1 2652 22 360

+

2652 360

. As a result, since S is the difference

between these two means of means, SE S = = 265/√360 = 13.97.

1 2652 22 360

+

2652 360

+

1 2652 22 360

+

2652 360

286

Example 12.13

12 MORE THAN TWO INTERVENTIONS

(Continued)

The test for the main effect of salmeterol is then z = S/SE(S) = 76.5/13.97 = 5.48. This is very large and has a very small p-value < 0.00001. Similarly, the main effect of fluticasone is F = 1302 +2 1396 − 1264 +2 1323 = 55.5, with the same standard error SE(F) = 13.97, so that z = F/SE(F) = 55.5/13.97 = 3.97 which from Table T2 has a pvalue = 2(1 − 0.99996) = 0.00008. Thus, the main effects S and F both have a statistically significant effect but that of salmeterol would appear to be clinically the more important. To test if there is an interaction, it is necessary to calculate S F = 1264 +2 1396 − 1323 +2 1302 = 17.5, which again has the same standard error SE(S.F) = 13.97. Thus, the z-test for the interaction is z = S.F/SE(S.F) = 17.5/ 13.97 = 1.25 which from Table T2 has a p-value = 2(1 − 0.89435) = 0.21. This is not statistically significant at the 5% level, and the magnitude of the interaction is small relative to the main effects and is therefore unlikely to be of any clinical consequence.

12.5.4.2 Modelling approach As we noted immediately prior to Example 12.9, the usual notation when describing a 2 × 2 factorial design is to indicate one level of (say) factor A by (1) and the other level by (a). Similarly factor B is represented by (1) and (b). Combining these we have the four binary pairs (1)(1), (a)(1), (1)(b) and (a)(b). For randomisation purposes we label these as (1), (a), (b) and (ab) but for modelling purposes we regard these as (0,0), (1, 0), (0,1) and (1, 1). Thus, for each patient included in the trial of Example 12.13 their allocated randomised option can be represented as a binary pair (xS, xF). The analysis of a factorial design can then be conducted using the regression model approach which, for the situation we have just described, may be written (omitting the random error term) as y = β0 + βS x S + β F xF ,

(12.12)

where y is FEV1, while β0, βS and βF are the corresponding regression coefficients. In which case, bS and bF estimate the main effects of S and F respectively. To test for the presence of an interaction, the above model is extended to y = β0 + βS xS + βF xF + βSF xS xF

(12.13)

Here, the extra variable is merely the product of xS and xF, and the interaction is assessed by testing the null hypothesis that: βSF = 0. Once again, the advantage of the regression approach is that further terms can be added to Equations (12.12) or (12.13), to allow for other variables that influence

12.5 FACTORIAL TRIALS

287

the outcome; in the context of a clinical trial, however, any (no more than two) added covariates must be knowingly prognostic and identified as such in the trial protocol. 12.5.5 Practical issues The factorial design may be particularly useful in circumstances where (say) factor A addresses a major therapeutic question, while factor B poses a secondary one. For example, A might be the addition of a further drug to an established combination chemotherapy for a specific cancer while B may be the comparison of anti-emetics delivered with the drugs. However, the concern over the estimation of any interaction between the two factors remains, although its very presence could not be detected if the two questions are not posed simultaneously. The advantages with respect to the factorial design enabling interactions to be estimated, and practical difficulties with respect to statistical bias if the numbers of patients recruited in each cell of the design are unequal, have been highlighted by Lu, Lee, Young, et al. (2020). As we emphasised when describing Figure 1.2, in some cases, the best experimental design may not be a practical option for the trial. For example, in the context of a planned 2 × 2 factorial trial of (say) two drugs A and B, against a placebo for each, there are four combinations (1), a, b, and ab. With these combinations, there is the intention that one in four of the patients receives both placebos, with therefore no chance of activity although in an adjuvant treatment setting this may turn out to be the best option were it to be tested. Equally, patients may receive both A and B, perhaps associated with feared unacceptably high toxicity. These considerations may reduce the optimal fourgroup parallel design to a practical three-group design of either a [a, b, ab] or [(1), a, b] configuration depending on the circumstances. Both these designs are statistically less efficient than the full factorial, and so may require more patients than the full design to answer the less complete range of questions. The design of Example 12.1, comparing two drugs and their combination, has the format [a, t, at] and so too may be regarded as partially structured, rather than unstructured, depending on the alternative hypotheses of interest.

Example 12.14

Malignant pleural mesothelioma

The trial conducted by Muers, Stephens, Fisher, et al. (2008) investigated the role of adjuvant chemotherapy using a combination of mitomycin, vinblastine and cisplatin (MVP) and single-agent vinorelbine, Vi, in patients with malignant pleural mesothelioma. All patients received active symptom control (ASC) for their disease. The patients were randomised on a 1 : 1 : 1 basis to receive ASC alone (1); or with Vi (vi), and with MVP (mvp). The potential fourth arm with ASC and the combination of (mvp) with (vi) to complete a 2 × 2 factorial design was presumably not considered appropriate in this situation. In such a case, the design is equivalent to one in which several interventions, here two, are compared with a standard as discussed in Section 12.3.

288

12 MORE THAN TWO INTERVENTIONS

12.5.6 Reporting Although no new principles for reporting are raised with factorial designs, it is nevertheless important to describe very carefully the steps in arriving at the final sample size chosen. Care is also needed in describing the randomisation processes, particularly, the block size and also whether or not both factors were randomised at the same point in time. The structure of a 2 × 2 design necessitates giving the results (at least for the primary endpoint) in a tabular format which details the number of subjects, the associated summary statistic appropriate to the endpoint of concern with, as appropriate, an indication of the variability for each of the four cells. Further to these, the magnitude of the two main effects and their 95% CIs should be indicated. Also included should be the corresponding details associated with the possible interaction between the two factors concerned.

Example 12.15

Unilateral cleft lip and palate

Figure 12.9 illustrates a possible tabular representation of the proportion of infants having adequate velopharyngeal (VP) function following randomisation to two types of palatal surgery and two timings in the trial conducted by Williams, Seagle, Pregoraro-Krook et al., (2011). The proportions with adequate VP are presented in the main body of the table, while the estimated difference in proportions of the two factors is presented in the right and lower margins. Finally, information with respect to the presence of an interaction between the factors is given in the bottom right hand corner. Timing of surgery (m) 09–12

15–18

Sub total

Palate surgery Furlow

Langenbeck

Sub total

69/78

74/103

143/181

(88%)

(72%)

(79%)

72/95

70/100

142/195

(76%)

(70%)

(73%)

141/173

144/203

285/376

(82%)

(71%)

(76%)

Main effect of type of surgery

Main effect of timing Difference

6.2%

95% CI

−2.4 to +14.8%

p-value

0.16

Interaction

Difference

10.6%

Interaction

4.4%

95% CI

+2.1 to +19.1%

95% CI

−4.3 to +13.0%

p-value

0.017

p-value

0.33

Figure 12.9 Outcome for the proportion of infants with Unilateral Cleft Lip and Palate with normal resonance or adequate velopharyngeal function in the respective palate surgery by age groups. Source: Data from Williams, Seagle, Pregoraro-Krook et al. (2011).

12.6 COMPLEX STRUCTURE COMPARISONS

12.6

289

Complex structure comparisons

12.6.1 Design As complex suggestions, there are clearly many trial designs which might be covered. Consequently, we just choose one design in which albeit three drugs are to be compared, two of the drugs are different but belong to the same class of compounds, while the third drug is from a second class of compounds and which have a different mode of action.

12.6.2 Trial size

Example 12.16

Newly diagnosed type 2 diabetes

In the trial of Example 1.3, Weng, Li, Xu, et al. (2008) were interested to determine whether the disease-modifying effect in newly diagnosed patients treated for type 2 diabetes was due to the insulin therapy itself or due to the effects of simply eliminating glucotoxicity by achieving excellent glycaemic control. Further, if the latter was the case then they wished to determine which of two early intensive therapies would be more beneficial. The trial compares two short-term insulin therapies, multiple daily insulin injections (MDI) and continuous subcutaneous insulin infusion (CSII), both of which target overall glycaemic control, with an oral hypoglycaemic agent (OHA). The endpoint measure is a binary variable as the objective of the trial is to estimate the respective treatment remission proportions at 1 year, π OHA, π MDI and π CSII. However, the authors identified two specific comparisons they wished to make which imply testing the following two null hypotheses: H01: π OHA = (π MDI + π CSII)/2 and H02: π MDI = π CSII. Thus, in planning such a trial, the investigators would need to specify two planning effect sizes, δPlan1 and δPlan2. The authors anticipated that the proportions in long-term remission with OHA would be 25% while in those who received insulin treatment (CSII or MDI) it would be 45%. Thus, for testing H01, with 2-sided significance level of 5% and power 80%, use of Equation (9.6) with allocation ratio φ = 1 implies approximately 90 patients to receive OHA, and consequently 45 to receive CSII and 45 MDI. An alternative strategy may be to use a 1 : 2 ratio (φ = 2) as the investigators have another hypothesis under consideration. This scenario suggests 70 patients for OHA, and the same number to each of CSII and MDI (compared to 45 each earlier). The scenario involves larger patient numbers in total but, although retaining the 80% power for testing H01, actually reduces numbers required from 90 to 70 in

290

Example 12.16

12 MORE THAN TWO INTERVENTIONS

(Continued)

the OHA group. However, 70 patients in MDI and CSII are insufficient to test H02: π MDI = π CSII unless the anticipated difference between CSII and MDI is greater than 20%. The authors worked on the basis of 90 patients for each of the three groups, and this is sufficient to test for a 20% difference between CSII and MDI. However, now that 90 are also to be assigned to OHA, this implies a power increase from 80 to 90% for testing H01. An increase in power is often a good thing as it brings an increase in trial size which then enables a more precise estimate of the treatment differences to be calculated. In simplistic terms, designing on the basis of H01 might provide insufficient patient numbers to test H02. However, designing on H02 is likely to yield more than sufficient patient numbers for H01; consequently, the power will be increased. Clearly, the investigating team has to decide on the relative priorities of the two questions posed. In general, it is this kind of dilemma that makes such trials difficult to design as some compromise with respect to the final sample size will have to be achieved.

12.6.3 Analysis In Example 12.16, the analysis for testing of the two null hypotheses amounts to conducting separate two-sample tests the form of which will depend on the type of endpoint variable under consideration. These tests are not independent of each other as they each use, at least in part, the same patient observations.

Example 12.17

Newly diagnosed type 2 diabetes

The remission rates at 1 year reported by Weng, Li, Xu, et al. (2008) were 51.1% (68/133) with CSII, 44.9% (53/118) with MDI, and 26.7% (27/101) with OHA. The two-part analysis is illustrated in Figure 12.10 using the Stata command (prtest) for comparing two proportions. From the analysis of Figure 12.10a, one might conclude that there is a real difference in remission rates between the treatment types used (OHA versus CSII or MDI) which is estimated as 21.5% (95% CI 11 to 32%), p-value = 0.0002. However, the difference between the use of CSII and MDI of Figure 12.10b is not proven, with an estimated difference of only 6.2%, a 95% CI of −6.1 to 18.6% which includes the null hypothesis difference of zero, and has a p-value = 0.33.

12.6 COMPLEX STRUCTURE COMPARISONS

Example 12.17

291

(Continued)

(a) tabulate Type Remission prtest Remission, by (Type) Output – First hypothesis -------------+-----------------------+-------| Remission | Type | No Yes (%) | Total -------------+-----------------------+-------CSII or MDI | 130 121 (48.2) | 251 OHA | 74 27 (26.7) | 101 -------------+-----------------------+-------Total | Difference (21.5) | 352 -------------+-----------------------+-------Two-sample test of proportions ----------------------------------------------------------------Group | Mean SE z P>|z| (95% CI) -------------+--------------------------------------------------CSII or MDI | 0.4821 OHA | 0.2673 -------------+--------------------------------------------------Diff | 0.2147 0.0542 3.69 0.0002 0.1086 to 0.3209 -----------------------------------------------------------------

(b) tabulate Treat Remission prtest Remission, by (Treat) Output – Second hypothesis ---------+-----------------------+-------| Remission | Treat | No Yes (%) | Total ---------+-----------------------+-------CSII | 65 68 (51.1) | 133 MDI | 65 53 (44.9) | 118 ---------+-----------------------+-------Total | Difference (6.2) | ---------+-----------------------+-------Two-sample test of proportions ------------------------------------------------------------Group | Mean SE z P>|z| (95% CI) --------+---------------------------------------------------CSII | 0.5113 MDI | 0.4492 --------+---------------------------------------------------Diff | 0.0621 0.0631 0.98 0.33 -0.0614 to 0.1857 -------------------------------------------------------------

Figure 12.10 Edited Stata commands and output for analysing a parallel-group trial comparing three treatments for diabetes mellitus. Source: Data from Weng, Li, Xu, et al. (2008).

292

12 MORE THAN TWO INTERVENTIONS

In Example 12.16, the analysis for testing of the two null hypotheses amounts to conducting separate two-sample tests the form of which will depend on the type of endpoint variable under consideration. These tests are not independent of each other as they each use, at least in part, the same patient observations. As we noted in Chapter 8, in circumstances where there is multiple repeat statistical testing within the same data set, then this can cause the single significance level set at the design stage of the trial to be no longer applicable. The way in which the significance level changes depends in a complex way on how many, and which comparisons are to be made, and usually this cannot be readily quantified. One method used is to apply the Bonferroni correction to each of the p-values obtained. This simply multiplies each of these by the number of statistical tests undertaken. Thus, from the analysis of Figure 12.10, the two p-values are 0.0002 and 0.33 which are then modified to become 0.0004 and 0.66, respectively. In this example, these changes have little influence on the interpretation.

CHAPTER 13

Paired and Matched Designs

This chapter introduces designs in which, in the case of two interventions, each participant in the trial receives both. In some situations, the alternative interventions may be given simultaneously, for example, one intervention in one eye the other in the other eye. The comparison in this matched-pair design is then made by utilising the difference in outcome between the two eyes. In other situations, the two interventions are given one after the other over two periods of time in one of two possible sequences in what is termed a cross-over trial. These two sequences are then randomised so that, in general, half the participants are allocated one sequence and half the other. The split-mouth design is of particular relevance to dental studies. Using this design one part of the mouth, for example, receives one of the interventions while another the comparator intervention. The interventions concerned are randomised to each part. Advantages and limitations of the use of these designs are outlined. Appropriate methods for analysis and determination of sample size are included.

13.1

Matched-pair trials

13.1.1 Design In general, a matched-pair design enables the subjects concerned to receive both of the two interventions under test. This pairing allows a within-subject comparison of the alternative interventions and hence has the potential to estimate differences between them more efficiently.

Example 13.1

Anaesthesia for deformed eyelid surgery

The randomised trial of Pool, Struys and van der Lei (2015) provides a clear example where a matched-pair design is very appropriate. In that trial, patients who required upper blepharoplasty to repair both deformed eyelids are randomised, prior to surgery, to receive one form of anaesthesia to one eyelid and a comparator

Randomised Clinical Trials: Design, Practice and Reporting, Second Edition. David Machin, Peter M. Fayers, and Bee Choo Tai. © 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.

Example 13.1

(Continued)

form to the other. Thus, anaesthesia using Lidocaine with Epinephrine (LE) was given to one eyelid and Prilocaine with Felypressin (PF) to the other. The trial randomised an equal number of patients either to LE or PF to the right eyelid with the left eye automatically allocated the other anaesthesia option. Following each anaesthetic procedure, the patient was asked to score the pain experienced on infiltration using a visual analogue scale (VAS) ranging from 0 (no pain at all) to 10 (unbearable pain). Thus, each patient provides a matched-pair of observations. The outcome measure for assessing the relative merits of the two procedures was the difference between the two VAS pain scores recorded. Despite, the outcome of concern comprising a single difference it is useful to examine the individual profiles contributing to this difference from each patient. These are shown in Figure 13.1 separated into the two groups depending on the anaesthesia given to the right eyelid. The figure shows many different profiles, some showing lower VAS scores with PF, others lower scores with LE while three patients have the same score from both eyelids. Also, the patterns in Figure 13.1a and b differ somewhat. The former indicating a worsening, hence higher VAS, with LE in 12 patients, improvement in 7, and no change in one. The latter with all but two patients experiencing a lower VAS with LE. This observation suggests there may be some influence on which anaesthetic is given first (always to the right eye in this trial design). (a)

PF right, LE left eyelid

10

Prilocaine with Felypressin

Lidocaine with Epinephrine

Sequence PF-LE

8

(b)

LE right, PF left eyelid

10

Prilocaine with Lidocaine with Felypressin Epinephrine Sequence LE-PF

8

6

4

4

2

2

VAS

6

0

Right eyelid

Left eyelid

0

Left eyelid

Right eyelid

Figure 13.1 Spike plots of the VAS profiles of individual patients receiving anaesthesia with (a) 20 patients given PF to the right eyelid and (b) 20 given LE to the right eyelid. Souce: Data from Pool, Struys and van der Lei (2015).

13.1 MATCHED-PAIR TRIALS

295

In trials with a matched-pair format, equal randomisation is strongly advised as in the example of the eyelid surgery trial referred to above, which was organised so that half the patients were assigned (Right eyelid – LE; Left eyelid – PF) and half (Right eyelid – PF; Left eyelid – LE).

Example 13.2

Sputum samples from patients with cystic fibrosis

In the trial conducted by Equi, Balfour-Lynn, Bush and Rosenthal (2002), sputum specimens were obtained and cultured from patients with cystic fibrosis after receiving placebo (P) and also after receiving azithromycin (A). A total of 41 patients were recruited, 20 received A followed by P, and 21 P followed by A. Each culture was then assessed as to whether or not it grew P. aeruginos leading to a binary Negative (0) or Positive (1) response as in Figure 13.2. Equivalently, the possible pairs of response for each subject are (0, 0), (1, 0), (0, 1) and (1, 1). Grew P. aeruginosa culture with Placebo

Grew P. aeruginosa culture with Azithromycin Negative

Positive

Total

Negative

17

6

23

Positive

7

11

18

Total

24

17

41

Figure 13.2 Results from a matched-pair design comparing the numbers of patients with cystic fibrosis whose cultures grew P. aeruginosa after receiving Placebo (P) and after receiving Azithromycin (A). Source: Data from Equi, Balfour-Lynn, Bush, et al. (2002).

13.1.2 Analysis 13.1.2.1 Difference in means For a continuous endpoint measure, each of the N = 2m subjects will have individual observations ySi and yTi respectively from interventions S and T. From these the within patient difference is di = yTi − ySi. The mean difference is, d = Ni= 1 di 2m = yT − yS, the 2

N − 1 , degrees of freedom, df = N − 1, standard deviation, SD d = Ni= 1 di − d N. The null hypothesis postulates that the true and the standard error SE d = SD d difference between the interventions δ is zero and the corresponding paired t-test for testing this null hypothesis is

296

13 PAIRED AND MATCHED DESIGNS

t=

d SE d

(13.1)

Although the regression command of Figure 13.3 provides no new insight, it has the potential to adjust the comparison between the treatments in the matched-pair trial to allow for patient characteristics which are thought to be of prognostic importance for the outcome.

Example 13.3

Anaesthesia for deformed eyelid surgery

The trial of Pool, Struys and van der Lei (2015), described in Example 13.1, randomised 40 patients of whom 20 received LE to the right eye and which was operated on first and hence PF to the left and operated second. The other 20 patients had the reverse order for their eyelid anaesthesia and surgery. The data arising from this trial are given in Table 1 of the associated publication from which the mean VAS score is 5.175 units with PF anaesthesia calculated by combining the data from the 20 patients who received PF in the right eyelid together with the other 20 patients who received PF in the left eyelid. Similarly, for LE the relevant data from right and left eyelids give a mean VAS of 4.475 units. The corresponding SDs are 2.2290 and 2.0877 units respectively. The mean of the N = 2m = 40 paired differences (PF–LE) is d = 0 70 indicating somewhat greater pain with PF anaesthesia. The SD d = 2 1147 while the SE d = 2 1147 40 = 0.3344. Using Equation (13.1) with these values gives t = 0.70/0.3344 = 2.09. In Table T4 of the Student’s t-distribution, this as a value between 2.023 and 2.125, in the row for df = 39, and the columns for α = 0.05 and 0.04 respectively. This suggests a p-value ≈ 0.045. Alternatively, as the df are relatively large, Table T2 of the Normal distribution can be used to give the p-value = 2(1 − 0.98169) = 0.037. The associated analysis is summarised in Figure 13.3a using the command (ttest) followed by the hypothesis that is to be tested (Diff==0). The results shown closely mirror the previous calculations although the p-value is calculated more accurately as 0.043. However, an additional advantage is that the 95% CI for the difference between the anaesthesia regimes is provided as 0.02 to 1.38 VAS units. An alternative approach to the analysis of a matched design is to fit a linear model using a regression command (see Figure 8.11) but in this case there is no x variable only Diff (the y variable) so the command is simply (regress Diff). The final row of Figure 13.3b gives the corresponding output which replicates the earlier analysis.

13.1 MATCHED-PAIR TRIALS

Example 13.3

297

(Continued)

(a) Command ttest Diff == 0 Output One-sample t test --------------------------------------------------------Variable | Obs Mean SE (95% CI) ---------------+----------------------------------------(PF−LE): Diff | 40 0.70 0.3344 0.0237 to 1.3763 --------------------------------------------------------mean = mean(Diff), t = 2.0936 Ho: mean = 0, degrees of freedom = 39 Ha: mean ! = 0, Pr(|T| > |t|) = 0.043 (b) Alternative command

regress Diff Analysis of Variance (ANOVA) --------------------------------Source | SS df MS ----------+---------------------Model | 0 Residual | 174.4 39 4.4718 ----------+--------------------------------------------------Total | 174.4 39 4.4718 SD = 2.1147 -------------------------------------------------------------Diff | Coef SE t P>|t| (95% CI) ----------+--------------------------------------------------PF–LE | 0.70 0.3344 2.09 0.043 0.0237 to 1.3763 --------------------------------------------------------------

Figure 13.3 Edited commands and output for analysing a randomised trial, with a pair-matched design, comparing two types of anaesthesia for bilateral blepharoplasty of the upper eyelids. Source: Data from Pool, Struys and van der Lei (2015).

298

Example 13.4

13 PAIRED AND MATCHED DESIGNS

Anaesthesia for deformed eyelid surgery

As Pool, Struys and van der Lei (2015, Table 1) recorded the age of the patients who were recruited to their trial and we assume, for the purposes of illustration, that these should be accounted for in the analysis. Thus, we suppose that pain levels experienced are known to differ with age so that, if we do not take age into account, a false impression of the magnitude of any difference between the anaesthesia options may result. Thus, to adjust for age, we extend the regression model previously described, after writing Diff as y for convenience, from y = β0 to y = β0 + βAgey(Agey). This model is fitted using the command (regress Diff Agey). The corresponding command and output are described in Figure 13.4.

Commands tabstat Agey, by(Seq) stat(n min mean max) row regress Diff Agey Output -------------------------------------Sequence | N mean min max ----------+--------------------------PF-LE | 20 55.50 39 70 LE-PF | 20 55.45 43 82 ----------+--------------------------Total | 40 55.475 39 82 -------------------------------------Analysis of Variance (ANOVA) ----------+-------------------------------------------Source | SS df MS Number of obs = 40 ----------+-------------------------- F(1, 38) = 1.65 1 7.246 Model | 7.246 Prob > F = 0.21 Residual | 167.154 38 4.399 ----------+-------------------------------------------Total | 174.400 39 --------------------------------------------------------------------------------------------------------------------Diff | Coef SE t P>|t| (95% CI) ---------+----------------------------------------------------cons | -1.8014 1.9769 Agey | 0.0451 0.0351 1.28 0.21 -0.0260 to 0.1162 ---------+-----------------------------------------------------

Figure 13.4 Edited commands and output for analysing a matched-pair randomised trial in patients with bilateral blepharoplasty of the upper eyelids, comparing two types of anaesthesia adjusted for patient age. Source: Data from Pool, Struys and van der Lei (2015).

13.1 MATCHED-PAIR TRIALS

Example 13.4

299

(Continued)

The regression coefficient estimates are b0 = −1.8014 and bAge = 0.0451. The latter is not statistically significant since t = 0.0451/0.0351 = 1.28 based on df = 38 and has p-value = 0.21. Thus, we conclude that age has little influence on the comparison between treatments so that the earlier (and simpler) model of Figure 13.3b with b0 = 0.70 alone is indicated.

The Analysis of Variance (ANOVA) tables of Figures 13.3b and 13.4, although not essential features of the analyses just described, do become very useful in situations when one is comparing more than two interventions such as in the designs we described in Chapter 12. ANOVA essentially partitions the total variation between individual observations into component parts, individual parts being attributed to the influence of the variables under considerations (only Agey in Figure 13.4) and the remainder to the residual or random error variation. Thus, in the ANOVA the total variation of 174.400 is split into 7.246 due to age and the remainder 167.154 is termed residual which in this example is the most substantial part. The corresponding variances are calculated by dividing these by the respective degrees of freedom, df = 1 and 38, to give 7.246 and 167.154/38 = 4.399. The Fisher F-test then compares these by their ratio, so that F(1, 38) = 7.246/4.399 = 1.65 with corresponding p-value = 0.21. Had the p-value been smaller than (say) 0.05 (which it is clearly not the case here) the null hypothesis of no effect of age (on the observed differences between the anaesthesia options compared in the matched-pair trial) would be rejected and Agey would be retained in the model describing the trial results. In which case, the final regression model chosen to describe the outcome is Diff = −1.8014 + 0.0451 Agey. For example, this model would imply that for a patient of age 40 years, Diff40 = −1.8014 + (0.0451 × 40) = 0.0026 units while for one of age 60, Diff60 = 0.9046 units. However, if the mean age of 55.475 years is substituted in this equation then DiffMean = 0.7000 units which is the same estimate as that given in Figure 13.3a. The same format of ANOVA is given in Figure 13.3b but, for this, the model Sum of Squares, SS = 0, so that the total variation of 174.4 cannot be partitioned in this case. Essentially all the variation is then regarded as residual or random error. 13.1.2.2 Difference in proportions If the endpoint is binary, then the cross-over trial results can be summarised in the format of Figure 13.5. This has a similar format to Figure 9.3 but is repeated here for convenience. As we have indicated when describing the Equi, Balfour-Lynn, Bush and Rosenthal (2002) trial in Example 13.2, culture specimens from the patients are judged as to whether they are negative or positive with respect to the growth of P. aeruginosa during the two treatment periods.

300

13 PAIRED AND MATCHED DESIGNS Grew P. aeruginosa culture with Placebo

Grew P. aeruginosa culture with Azithromycin Negative

Anticipated

Positive

Total

proportions

Negative

e

f

e+ f

1− πP

Positive

g

h

g+ h

πP

Total

e+ g

f+ h

NPairs

Anticipated proportions

1 −πA

πA

Figure 13.5 Notation for a matched-pair design comparing the numbers of patients with cystic fibrosis whose cultures grew P. aeruginosa after receiving placebo (P) and after receiving azithromycin (A)

In Figure 13.5 the letter e, for example, represents the number of patients whose culture was negative with both azithromycin (A) and placebo (P). The difference between the proportions that are culture positive with A and with P is estimated by d=

f +h g+h f −g − = N Pairs N Pairs N Pairs

(13.2)

Alternatively, the data arising from such designs are sometimes summarised using the odds ratio, calculated as ψ = g/f. Thus, ψ is a measure of how much more likely it is that a patient will be culture positive with P as opposed to when receiving A. We note that the (e + h) patients who respond to both P and A in the same way (that is, they are either negative with both treatments or positive with both treatments) do not enter this calculation. The corresponding expressions for the exact confidence intervals for ψ are complex but are usually an integral part of the output of the statistical packages used for the analysis. For the paired interventions, the null hypothesis implies that f and g are expected to be equal given that there is a total of f + g discordant pairs. In large samples this leads to the McNemar test z=

Example 13.5

f −g f +g

(13.3)

Culture of P. aeruginosa in cystic fibrosis

Using the data from Equi, Balfour-Lynn, Bush and Rosenthal (2002) tabulated, following the command (tabulate Placebo Azithromycin) in f −g 6−7 = −0.0244 and ψ = 7/6 = 1.17. Figure 13.6, from which d = = N Pairs 41 6−7 = − 0 28. Use of Table T2 gives the p-value = 2 Equation (13.3) gives z = 6+7

13.1 MATCHED-PAIR TRIALS

Example 13.5

301

(Continued)

(1 − 0.61026) = 0.78 which is clearly not statistically significant. We should note that NPairs = 41 is not an even number, 20 were assigned the sequence AP but 21 PA. The corresponding command (mcc Placebo Azithromycin) and the edited output are given in Figure 13.6. We stated that Equation (13.3) was really only for large samples which this example is clearly not. This accounts for the discrepancy between the p-value = 0.78 obtained in our calculation and that of the ‘exact’ result with a value of 1. For small samples, a modification to Equation (13.3) can be made by reducing the absolute value (that is ignoring a minus sign if the difference is negative) of the numerator by 1. This leads in our example to z = [abs(−1) − 1]/√13 = 0.0 and so Table T2 gives the p-value = 2(1 − 0.5) = 1 which then equals the exact value.

Commands tabulate Placebo Azithromycin mcc Placebo Azithromycin Edited Output

-----------+----------------------+------| Azithromycin | Placebo | Negative Positive | Total -----------+----------------------+------Negative | 17 6 | 23 Positive | 7 11 | 18 -----------+----------------------+------Total | 24 17 | 41 -----------+----------------------+------McNemar's chi2(1) = 0.08 Prob > chi2 = 0.7815 Exact McNemar significance probability = 1 Proportion with factor Placebo 0.4390 Azithromycin 0.4146 ----------------------------------------(95% CI) ----------------------------------------difference 0.0244 -0.1722 to 0.2210 odds ratio 1.1667 0.3357 to 4.2020 -----------------------------------------

Figure 13.6 Edited commands and output from a statistical package for analysing a matchedpair design in patients with cystic fibrosis. Source: Data from Equi, Balfour-Lynn, Bush, et al. (2002).

302

13 PAIRED AND MATCHED DESIGNS

13.1.3 Trial size 13.1.3.1 Continuous outcome When considering a matched-pair design, there are several sources of variation that one has to consider and also any relationship there is between them. The within-subject standard deviation, σ Within, quantifies the anticipated variation among repeated measurements on the same individual, irrespective of the treatment received – perhaps the resting blood pressure of a healthy individual recorded several times in 1 day. It is a compound of true variation in the individual and any measurement error. In an unmatched design, subjects will only receive one of the interventions either S or T with corresponding continuous observations either yS or yT, and associated standard deviations either SD(yS) or SD(yT). These two standard deviations quantify the betweensubject variation of those receiving S and T respectively and are independent of each other. In contrast, for a matched-pair design with two sequences ST and TS involved each subject recruited contributes a pair of observations, yS and yT, and so also contributes to both SD(yS) and SD(yT) and so they are not independent of each other. However, the analysis of a paired design focuses on the differences, d = S − T, with standard deviation, SD(d). Elbourne, Altman, Higgins, et al. (2002, p. 149, Equation (A)) provide the following relationship between the standard deviations of observations made on a patient receiving both S and T, and that of their difference as SD d

2

= SD yS

2

+ SD yT

2

− 2ρSD yS SD yT

(13.4)

Here ρ, which must take a value between −1 and +1, is the correlation between the yS and yT outcomes calculated from all N = 2m subject pairs. In the common case, where SD(yS) and SD(yT) are assumed to be equal, and we label these as σBetween, then Equation (13.4) becomes σ 2Diff = σ 2Between + σ 2Between − 2ρσ 2Between = 2 1 − ρ σ 2Between

(13.5)

We note that Equation (13.5) implies that σ Diff < σ Between provided ρ > 0.5 and this will usually be the case as experience suggests that ρ is often between 0.60 and 0.75 in this type of trial. When designing a matched-pair trial we will need to estimate the patient numbers required. This implies specifying the anticipated difference between the interventions, δPlan, and the anticipated value of SD(d), denoted σ Diff-Plan. A pragmatic way to obtain σ Diff-Plan is to postulate the range of values that the differences, di, are likely to take, and divide this range by four. Alternatively, if an anticipated value of σBetween is available then, for a given ρ, Equation (13.5) can be used. In which case, as ρ increases, σ Diff decreases and so would the chosen σ Diff-Plan. In general, a smaller σ Diff-Plan leads to a smaller sample size.

13.1 MATCHED-PAIR TRIALS

303

The number of pairs required for a matched-pair design with 2-sided test size α and power 1 − β, is estimated using an adaption of Equation (9.5) to give N Pairs =

z1 − α

2

+ z1 − β

δPlan σ DiffPlan

2

+

2

z1 − α 2

2

2

(13.6a)

In practice, it may be that only the size of Cohen’s standardises effect size is stipulated, so that ΔPlan = δPlan/σ DiffPlan, is used in Equation (13.6a) Julious, Campbell and Altman (1999) point out that in many instances 2 σ Diff = 2σ 2Within but further ‘Note that it is not always clear in the literature whether the statistic given (when planning sample sizes) is σ Diff or σ Within.’ For this reason, we reproduce the alternative expression for sample size here: N Pairs =

2 z1 − α

2

+ z1 − β

δPlan σ WithinPlan

2 2

+

z1 − α 2

2

2

(13.6b)

In this latter case, the denominator in the first component of this expression is no longer the standardised effect size, ΔCohen.

Example 13.6

Anaesthesia for bilateral blepharoplasty of upper eyelids

As we have seen in Example 13.3, the data provided by Pool, Struys and van der Lei (2015, Table 1) give for PF a SDPF = 2.2290 and for LE a SDLE = 2.0877 units, and which are very similar. Thus, it is reasonable to estimate σBetween by 2 22902 + 2 08772 2 = 2.1595 and so Equation (13.5) leads to σ Diff Plan = 2 1 − ρ × 2 1595 or 3 0540 × 1 − ρ. If a confirmatory trial is proposed then, using the information from this trial as a basis for planning, a value for δPlan = 0.7 might be stipulated. Assuming a 2-sided test size of 5%, power 80%, use of Table T3 and Equation (13.6a) gives 1 96 + 0 8416 2 1 962 = 149.40(1 − ρ) + 1.92. Assuming, ρ is + N Pairs = 2 2 0 7 3 054 × 1 − ρ between 0.45 and 0.75, then these two extremes imply NPairs is likely to be between 84.09 and 39.27. Cautious investigators might therefore recruit (say) 80 patients to the new trial with 40 patients per sequence LE–PF and PF–LE. In fact, the correlation of VAS between the LE and PF treatments in the 40 patients included in the trial is r = 0.5217. Assuming this value for ρPlan allows Equation (13.5) to be used to obtain σ Diff-Plan = 2.1121 and Equation (13.6a) gives NPairs = 74.

304

13 PAIRED AND MATCHED DESIGNS Test (T ) Standard (S)

Marginal

Failure (0)

Success (1)

proportions

Failure (0)

π00

π01

1− πS

Success (1)

π10

π11

πS

Marginal proportions

1− πT

πT

1

Figure 13.7 General notation for a paired 2 × 2 contingency table comparing a Test with a Standard treatment

13.1.3.2 Binary outcome The general notation for a paired 2 × 2 contingency table for comparing two treatments with a binary response variable are set out in Figure 13.7. The π 00, π 01, π 10 and π 11 represent the four probabilities for the binary pairs (0, 0), (0, 1) (1, 0) and (1, 1) respectively, while π S and π T are the anticipated response rates with S and T. With this notation the proportion of discordant pairs is π Disc = π 01 + π 10. StataCorp (2019) defines the relationship between the marginal proportions π S and π T in Figure 13.7 and the individual cell proportions π 01 and π 10 as follows: π 01 = π S 1 − π T − ρB

π S 1 − π S π T 1 − π T and π 10 = π 01 + π T − π S

(13.7)

where ρB is the correlation between the paired binary observations. Hence, π Diff = π 10 − π 01 and π Disc = π 10 + π 01

(13.8)

Equation (13.8) then leads indirectly to the formulation of the sample size derived from prespecified values for π S, π T and ρB as

N Pairs =

Example 13.7

z1 − α

2

π Disc + z 1 − β π Diff

π Disc − π 2Diff

2

(13.9)

Cystic fibrosis

Suppose the trial of Equi, Balfour-Lynn, Bush and Rosenthal (2002) was to be repeated but with a different formulation of Azithromycin (A) which was thought to be much more effective but, once again, to be compared with Placebo (P) in a matched-pair design involving the sequences PA and AP. Using the results of Figure 13.2 the observed proportion with a positive sample with P is π P = 18/ 41 = 0.44. For the new formulation of A, it is anticipated that π A = 0.34. If a 2-sided

13.2 CROSS-OVER TRIALS

Example 13.7

305

(Continued)

test size, α = 0.05 and a power of 1 −β = 0.8 is required, giving from Table T3, z0.975 = 1.96 and z0.80 = 0.8416. However, the investigators are not sure with respect to the specification for ρB and so they choose to investigate a range of possibilities. First assuming ρB = 0.5, use of Equation (13.7) gives planning values π 01 = 0.1728 and π 10 = 0.0728, from which Equation (13.8) gives π Diff = 0.10 and π Disc = 0.2457. Finally, Equation (13.9) 2

1 96 0 2457 + 0 8416 0 2457 − 0 12 = 190.44 or about 200 gives N Pairs = 0 10 with 100 subjects assigned to each sequence PA and AP. When examining how NPairs changes with changing ρB they find that larger (more positive values) greater than ρB = 0.8 are not compatible with the planning assumptions π P = 0.44 and π A = 0.34. In contrast, as ρB decreases from 0.8 to zero, the sample size required increases from NPairs = 80 to 376. A pragmatic choice in this situation may be to take a sample size between these two extremes, for example, when ρB = 0.35, NPairs = 246 so that 250 pairs may be recruited.

13.2

Cross-over trials

13.2.1 Design In certain situations, if two interventions are to be compared consecutively in the same subjects then if, after giving first component of the sequence, an interval of time is introduced before administering the second, this is termed a cross-over trial. In such circumstances, double-blind trials are particularly recommended. 13.2.1.1 Two-period – two-treatment In the case of a cross-over trial of the design of Figure 13.8, comparing a drug, A with a placebo P, one of them, say P, is given to a group of patients and then, sometime later in a second period, the same patients are all challenged with the drug A. Conversely other patients receive A in the first period and then subsequently receive P in the second. Thus, although each patient receives both of the treatments, some receive these in the order PA and some in the reverse order AP. Thus, in this two-treatment two-period cross-over trial, the participants are randomised to one of the sequences PA and AP. It is usual to recruit an equal number of participants, n, to each sequence and we assume this in what follows.

306

13 PAIRED AND MATCHED DESIGNS

13.2.1.2 Washout to reduce carry-over Now it is evident that those patients randomised to receive the sequence AP, receive P (in Period II) after they have previously received A in Period I. Thus, any residual or carry-over effect of A within the patient could influence the subsequent effect of P within that patient. If it does, one is then really comparing A with (P after A), rather than A with P. On a similar basis with the other sequence we will be comparing P with (A after P). This is a situation we wish to avoid, and consequently a ‘wash-out’ period following Period I is usually introduced before the intervention of Period II. The wash-out is intended to avoid the potential contamination of the treatment given in the first period upon the outcome in the second. The length of the washout period will depend on, for example, how long the drug remains active within the individual. If it has a transient effect on the endpoint of concern, and thereafter is soon eliminated from the body, then it may be presumed that the wash-out period can be relatively short and that there is little chance of a carry-over effect. In contrast, if the drugs are more likely to be excreted over a longer time, then an extended wash-out period will be required. It is this feature, the presence of a suitable washout period, which distinguishes a cross-over trial from that described earlier as a matched-pair design.

Example 13.8

Cross-over trial – Anacetrapib and blood pressure

Krishna, Anderson, Bergman, et al. (2007) describe a randomised placebo (P) controlled, two-period cross-over trial of anacetrapib (A) in healthy volunteers and their trial design is summarised in Figure 13.8. During the course of the trial, 24h-ambulatory blood was monitored on day-10 of each treatment period. The trial participants and investigators were blinded to the order in which the trial medication was administered.

Anacetrapib (A) Healthy volunteers

Run-in

Random allocation to sequence of treatments

Placebo (P)

W a s h o u t

Placebo (P)

Anacetrapib (A)

Figure 13.8 Randomised placebo controlled, two-period cross-over trial of anacetrapib in healthy volunteers. Source: Based on Krishna, Anderson, Bergman, et al. (2007).

13.2 CROSS-OVER TRIALS

307

13.2.1.3 Run-in In some cases, as in Figure 13.8, a ‘run-in’ period may also be a feature of a cross-over design. In general, such trials tend to recruit subjects for which there are extensive eligibility requirements that may entail detailed clinical and/or laboratory investigation before eligibility is confirmed. Thus, the purpose of the run-in period will usually not only be to screen potential participants to ensure their eligibility but also to allow the possibility of key baseline measures to be made. 13.2.2 Difficulties The complexity of the cross-over design, with a possible run-in, Period I, wash-out, and Period II, impacts on the type of subject that can be studied in this way as the wash-out has to ensure not only that the drug has been completely excreted but any residual effect on the subject is entirely eliminated. This implies, for example, that every patient recruited will return, prior to commencing Period II treatment, to the disease state they were in before the Period I drug had been given. Thus, the disease or condition needs to be relatively stable over time within the patient concerned, so that when treatment is withdrawn (at the end of Period I) the patients’ condition will return to the pretreatment state. An extreme example of when the cross-over trial should not be used is when there is a potential for cure with the Period I treatment. In such cases challenging any cured patient with the Period II intervention would be entirely inappropriate. Drop-out is a major problem in cross-over trials, because they require considerable co-operation from the subjects. A subject who misses the second period effectively nullifies their contribution in the first period. An extreme example of this difficulty is provided in the two-period two-treatment crossover trial conducted by Collados-Gómez, Ferrara-Camacho, Fernandez-Serrano, et al. (2018) who entered 137 pre-term infants requiring venepuncture into the first period of their design but only 66 of these infants entered the second period. One way of reducing drop-out rates is to ensure the trial is as short as possible and so a balance has to be struck between this and extending the wash-out to ensure that Period II is free of carry-over. Clearly the addition of a run-in period compounds the difficulties.

Example 13.9

Ambulatory blood pressure

In the investigation of ambulatory blood pressure changes of P or A in the trial conducted by Krishna, Anderson, Bergman, et al. (2007) there was a reasonable expectation that the healthy volunteers would return to their pre-Period I blood pressure after a suitable washout period and so no carry-over was anticipated. The authors state in their report:

308

13 PAIRED AND MATCHED DESIGNS

Example 13.9

(Continued)

There was at least a 14-day washout interval from the last dose of anacetrapib (or matching placebo) in between the treatment periods.

In this trial, there was also a run-in period of 4 days then a 10-day Period I treatment, a wash-out of a minimum of 14-days, and then a final 10-day Period II on the other treatment. Thus, the minimum time on the trial was 38 days. The purpose of the run-in was to screen volunteers for their eligibility by ensuring that they had a maximum blood pressure of less than 140/90 mmHg and a diastolic blood pressure that did not differ by more than 10 mmHg measured on days-4 and -1 prior to randomisation to the treatment sequence.

Cross-over trials are sometimes conducted without a wash-out included. One such example is that of Allan, Hays, Jensen, et al. (2008) who compared transdermal fentanyl with sustained-release oral morphine each used for 4 weeks for treating chronic noncancer pain. In this case the patients recruited could not be denied active pain relief that would be required for a wash-out period, as this would cause unethical suffering. Further, if the 1 : 1 randomisation between sequences is either not used or is not achieved, then the statistical properties of the cross-over design are compromised to some extent. Thus, the crossover trial of Kerley, Dolan, James and Cormican (2018) described in Example 1.8 seemly used simple (not blocked) randomisation to obtain unequal allocation with 13 patients to NP and 7 to PN. Neither did this trial include a washout period. Jones (2008) gives a short review of why these designs are so important in certain contexts. Further aspects of design and analysis of cross-over trials can be found in Senn (2002). Although we have only described a two-period-two-treatment design, an interesting example of a three-treatment-three-period cross-over trial is that of Lafeber, Grobbee, Schrover, et al. (2015) which compared a morning polypill, evening polypill and individual pills with respect to changes in LDL-cholesterol levels and ambulatory blood pressure in 81 patients with established atherosclerotic cardiovascular disease. 13.2.3 Analysis The first stage of analysis of a two-period two-treatment cross-over trial replicates that for comparing the intervention groups which we have described for a matched-design. For a continuous outcome this involves the use of the paired t-test of Equation (13.1) and, for a binary endpoint, the McNemar test of Equation (13.3). As we noted earlier, one advantage of using a statistical package for analysis is that they automatically give the precise p-values and an associated confidence interval. The second stage of the analysis of the cross-over trial is to consider the Period effect, in particular, whether this affects our view of the difference between the interventions concerned. Thus, the analysis now compares the interventions adjusted for any period

13.2 CROSS-OVER TRIALS

Example 13.10

309

Fuel metabolism during exercise

Small cross-over trials are sometimes conducted, such as that described by Jenni, Oetliker, Allemann, et al. (2008) who studied fuel metabolism during an exercise of cycling for 120 minutes. The cyclists were blinded to blood glucose levels set randomly at 5 (Euglycaemia) or 11 (Hyperglycaemia) mmol/l in only 7 physically active men with type 1 diabetes mellitus. The men were randomised to one of the sequences EH and HE with a 7 (range 6–10) week washout period. This trial is more a laboratory-based study than a therapeutic trial. In the corresponding analysis, df = 6 so that Table T4 would need to be used in this case. Thus, to be statistically significant at the 5% level the test statistic t would need to exceed 2.447 rather than 1.96 (the value from Table T3) and at the 1% level 3.707 rather than 2.5758 (from Table T3). Unfortunately, the authors appear to have ignored the paired nature of their design when presenting their analysis.

effect. In essence, the analysis looks at the mean dI − II and standard deviation SDI–II of the differences between the observations of Periods I and II within sequence ST and the mean dII − I and SDII–I of the differences between Periods II and I within sequence TS. From which the period-adjusted mean difference of the interventions is therefore: dAdjusted = d I − II + dII − I

2,

(13.10)

with SE dAdjusted =

1 2

SD2I − II + SD2II − I 2

1 1 + n n

(13.11)

In the situation here, n subjects are allocated to each sequence, dAdjusted of Equation (13.10) will equal d which forms part of Equation (13.1). However, in general, Equation (13.11) will not equal SE d of Equation (13.1).

Example 13.11

Anaesthesia for bilateral blepharoplasty of upper eyelids

Although the matched-pair trial of Pool, Struys and van der Lei (2015) is not a cross-over trial, in the sense of Figure 13.8, as it contains no wash-out period, however as one eyelid is operated on immediately after the other it rather mimics the cross-over situation. Also, although every patient receives both types of anaesthesia, they are given to different eyelids. However, as the authors provide the full outcome data, we are able to use their trial to illustrate the analysis of a cross-over trial as though it contains an appropriate washout.

Example 13.11

(Continued)

From Figure 13.9c the pooled estimate of SDPool = 2 6737 + 2 9474 2 = 1 6765 and hence, from Equation (13.11), SE Diff Period = 12 × 1 6765 1 20

+

1 20

= 0.2651.

The adjusted t-test in the modified format of Equation (13.1) gives t = 0.7/ 0.2651 = 2.64. From Table T2 the corresponding p-value = 2 (1 − 0.99585) = 0.0083. The 95% CI for the period adjusted treatment effect is 0.7 − (1.96 × 0.2651) = 0.180 to 0.7 + (1.96 × 0.2651) = 1.220 which is narrower than the unadjusted 95% CI of 0.024 to 1.376 that was given in Figure 13.3a. (a) Test for Comparing Treatments ttest Treat==0 One-sample t test ------------------------------------------------Variable | Obs Mean SE (95% CI) ---------+--------------------------------------PF - LE | 40 0.7 0.3344 0.024 to 1.376 ------------------------------------------------mean = mean(Treat), t = 2.0936 Ho: mean = 0, degrees of freedom = 39 Ha: mean != 0, Pr(|T| > |t|) = 0.043

(b) Test for Difference between Periods I and II ttest Period==0 One-sample t test -------------------------------------------------Variable | Obs Mean SE (95% CI) ---------+---------------------------------------Period | 40 -1.3 0.2847 -1.876 to -0.724 -------------------------------------------------mean = mean(Period), t = -4.5670 Ho: mean = 0, degrees of freedom = 39 Ha: mean != Pr(|T| > |t|) = 0.000048

(c) Treatment difference within each Period

------+-----------------------------| Mean Variance Seq | N (PF-LE) (PF-LE) ------+-----------------------------PL | 20 -0.6 2.6737 = SD2I−II LP | 20 2.0 2.9474 = SD2II−I ------+-----------------------------Treatment difference adjusted for Period DiffPeriod = [−0.6 + 2.0)]/2 = 0.7

Figure 13.9 Illustrative example of the analysis of a cross-over trial. Source: Data from Pool, Struys and van der Lei (2015).

13.3 SPLIT-MOUTH DESIGNS

311

13.2.4 Trial Size In brief, as the cross-over trial is similar in structure to the paired designs concerned earlier, sample size calculations make use of Equation (13.6a) for continuous outcomes It is also important to note that if Equation (13.6b) is to be used then planning values for σ Within are required – see Equations (13.4) and (13.5). For a binary endpoint sample sizes are obtained from Equation (13.9).

Example 13.12

Cross-over trial size – ambulatory blood pressure

At the planning stage of the trial of Figure 13.8 which was eventually conducted by Krishna, Anderson, Bergmann, et al. (2007), they assumed a within-subject standard deviation for average 24-h systolic blood pressure of σ Within-Plan = 5.7 mmHg and an anticipated difference of δPlan = 6 mmHg between that of P and A. Assuming a 2-sided test size of 5% and a power of 95%, use of Table T3 and Equa2 1 96 + 1 6449 2 1 962 = 25.38. This suggests + tion (13.6b) leads to N Pairs = 2 2 6 57 26 healthy volunteers should be recruited with n = 13 randomised to each treatment sequence. In fact, the investigators used a 1-sided test so that Equation (13.6b) then gives NPairs = 20.89 implying n = 11 per sequence.

13.3

Split-mouth designs

13.3.1 Design In a typical split-mouth design, although not confined to dental applications alone, one location in the mouth receives one of the interventions and another location the alternative. In a sense, the two locations replace Periods I and II in the cross-over design although there is now no temporal component. Thus, a dental trial may find an eligible tooth on the left lower jaw, and match it with an equally eligible tooth on the right. The corresponding randomisation essentially allocates one of the interventions (say the standard restorative approach) to the left or right side and then the other intervention (the test) is given to the designated tooth on the opposite side of the mouth. One might expect the allocation to left and right to be blocked so that a 1 : 1 ratio over all the patients eventually recruited to the trial is maintained. The unit for the eventual analysis, and hence for planning purposes, is the difference in outcome measure between the two (paired) teeth. We described the trial of Pozzi, Agliardi, Tallarico and Barlattani (2012) in Example 1.10 which used the split-mouth design when comparing dental implants. This design is evidently very similar to that of the cross-over trial except there is no equivalent to the ‘periods’ or to the ‘wash-out’ although a carry-over (now termed carry-across) effect is

312

13 PAIRED AND MATCHED DESIGNS

likely to be present. Since there are no ‘periods’ it is not necessary to have the treatment sequences (say AB and BA) in equal numbers across the total number of patient mouths involved. In certain applications, there may be several suitable matched pairs within one individual’s mouth.

13.3.2 Difficulties In principle, split-mouth designs could be applied to the clinical situation of oral lichen planus described by Poon, Goh, Kim, et al. (2006) of Example 2.1, although with topical treatments being applied to each side of the mouth serious problems with carry-across effects would certainly ensue. This possibility would clearly have excluded this design option for that trial. The difficulties concerned with the use of split-mouth designs have been set out very clearly by Hujoel (1998) and we have based this section closely on that article. The author points out that the four major difficulties are associated with recruitment, possible bias, statistical efficiency and complexity of subsequent analysis. Clearly whatever the design chosen no clinical trial can be conducted without finding suitable patients with the condition in question. This requires precise eligibility criteria to be applied so that in the situation of dentition, eligible teeth have to be identified. However, for the split-mouth design if one tooth is found with the necessary characteristics, a second of similar condition also has to be located. Thus, for a patient to be eligible, a pair of matched teeth has to be identified.

Example 13.13

Restoratives for dental caries

The trial conducted by Lo, Luo, Fan and Wei (2001) compared two restoratives for dental treatment. They targeted school children in the age range of 6–14 years inclusive. However, amongst the 1327 pupils identified, only 89 who had one or two bilateral matched pairs of carious posterior teeth that required either class I or II restorations were selected as eligible. Thus 93% of the children examined were not eligible for the trial. In this case, identifying the children to examine would be straightforward, but determining which of these are eligible for the trial and which not must have been a very time consuming and resource-intensive business. As Hujoel (1998) states: The more complex the entry criteria, the more difficult the recruitment, and the more questions regarding generalizability (of the results) may arise.

The bias in the design arises as the effects of the treatments given to one location of the mouth have the potential to carry-across to the other location. Thus, the final comparison within the mouth is of A (with B given in the other side) versus B (with A given

13.3 SPLIT-MOUTH DESIGNS

313

on the other side). So, the comparison is eventually A(B)–B(A) which may give a biased view of the true difference A–B. Such a bias may magnify, have no effect or reduce the apparent difference but which of these happens is impossible to determine. This means that even if the difficulties of recruitment can be overcome there still remains an assessment by the design team of whether or not the ‘carry-across’ effect can be ignored or at least be regarded as minimal. The within-patient allocation of units does have the potential to increase the precision of the estimate of the difference between interventions. However, the increase in precision is directly related to the within-patient correlation coefficient, ρWithin, of the intervention specific responses within patients. If ρWithin is large and positive then this leads to fewer patients being required than if the same planning values were used for a randomised two (independent) group parallel design. Conversely if ρWithin is small, then it is the split-mouth design that requires the most patients. Hujoel and Loesche (1990) suggest that when only a few sites per mouth are studied then low within-patient correlation coefficients are common in periodontal research. This contrasts with applications for cross-over trials discussed earlier where the correlations tend to be higher. They also suggest, using evidence from a caries prevention trial, values of ρWithin between −0.17 and +0.02 are likely. Such low values providing a strong indication that a split-mouth design would not be useful in such a context. 13.3.3 Analysis Hujoel (1998) points out that the statistical analysis can be more complex although, in our view this should not be regarded as a major obstacle to their use. Nonetheless care at the analysis stage is certainly required so that, for example, due account is taken of the fact that the 101 bilateral matched pairs included within the trial of Lo, Luo, Fan and Wei (2001) were identified from only 89 children, 77 with a single pair and 12 with two sets of matched pairs. Nevertheless, a complex analysis may obscure the clarity of the clinical message intended and this is not a good thing. The form of analysis of the split-mouth design and the basics of sample size calculation of, for example, one matched pair of teeth per mouth, follow that of the cross-over trial although in this case the left- and right-hand cavities (if appropriate) would replace the period but wash-out cannot be used to prevent contamination.

Example 13.14

Split-mouth design – caries prevention

The results of a split-mouth design conducted by Arrow and Riordan (1995) which compared glass-ionomer cement (GIC) and a resin-based fissure sealant (RFS) for caries prevention in 352 cases are summarised in Figure 13.10. For this example we have, using Equation (13.3), z = (40 − 77)/√(40 + 77) = −3.42 and from Table T2 the p-value = 2(1 − 0.99969) = 0.0006. The corresponding estimate of the odds ratio is OR = f/g = 40/77 = 0.5195 implying that children are almost twice as likely to be caries-free with the RFS approach. However, this trial was not randomised and so the results must be viewed with extreme caution.

Example 13.14

(Continued)

Resin-based fissure

Glas-ionomer Cement (GIC)

sealant (RFS)

Caries

No caries

Total

Caries

9 (e)

40 (f)

49

No Caries

77 (g)

226 (h)

303

Total

86

266

352

Figure 13.10 Results from a split-mouth non-randomised design comparing Glass-Ionomer Cement (GIC) with Resin-based Fissure Sealant (RFS) for caries prevention. Source: Data from Arrow and Riordan (1995).

The associated McNemar analysis using a statistical package is given in Figure 13.11 using the command (mcc GIC RFS). The output indicates the two alternative measures for summarising the results with the associated confidence intervals. One is the difference in proportions with caries indicating 11% (95% CI 4 to 17%) fewer with the RFS and the odds ratio of 0.52 (95% CI 0.35 to 0.77) that was calculated earlier. We note too that the p-value obtained from the test of 0.0006 is increased to 0.0008 using the more sensitive (termed exact) method of calculation. In this case, this change makes no material difference to the interpretation of the study results. Command mcc GIC RFS Output ---------------------------------------------------| GIC | RFS | Caries No caries | Total --------------+-----------------------------+------Caries | 9 40 | 49 No caries | 77 226 | 303 --------------+-----------------------------+------Total | 86 266 | 352 ---------------------------------------------------McNemar's chi2(1) = 11.70, Prob > chi2 = 0.0006 Exact McNemar significance probability = 0.0008 Proportion with factor RFS 0.1392(49/352) GIC 0.2443(86/352) ---------------------------------------------------(95% CI) ---------------------------------------------------difference -0.1051 -0.1672 to -0.0431 odds ratio 0.5195 0.3454 to 0.7708 (exact) ----------------------------------------------------

Figure 13.11 Edited commands and output from a statistical package for analysing a splitmouth design comparing treatments for caries prevention. Source: Data from Arrow and Riordan (1995).

13.3 SPLIT-MOUTH DESIGNS

315

13.3.4 Trial size 13.3.4.1 Continuous outcome To estimate the size of a split-mouth design with a continuous endpoint use can be made of expressions (13.6a) or (13.6b) given for the matched-pair situation. However, there are situations in which more than one site is available in each of the two segments of the oral cavity concerned. Each of these k > 1 sites from one segment will be randomised to receive S and those k > 1 sites in the other segment will receive T. To account for the k sites per segment, Zhu, Zhang and Ahn (2017) provides a modification to Equation (13.6a) to give the corresponding sample size when k ≥ 2 as N Pairs =

1−ρ k

z1 − α

2

+ z1 − β

δPlan σ DiffPlan

2 2

+

z1 − α 2

2

2

,k ≥ 2

(13.12)

In the more usual situation when k = 1, expressions (13.6a) or (13.6b) will be used.

Example 13.15

Split-mouth design – restorative dental treatment

Lo, Luo, Fan and Wei (2001) of Example 13.13 report the net occlusive wear 2 years later after applying ChemFlex and Fuji IX Gp to permanent teeth in 53 school children as 75 and 79 μm respectively with corresponding SDs of 23 and 20 μm. If the trial were to be repeated, how many children would need to be randomised? The SDs quoted are for between subjects so that, as they are based on equal numbers of children, the averaged value is SDBetween = 232 + 202 2 = 21.55. Using this as the value for planning we have, using Equation (13.5), σ DiffPlan = 21 55 × 2 1 − ρPlan = 30 48 1 − ρPlan. Further, assuming δPlan = (79 − 75) = 4 μm, then the anticipated standardised effect size, ΔPlan = 30 48 41 − ρ = Plan

0 1312 . 1 − ρPlan

This provides a wide range of possibilities for the eventual sample size depending on the value of ρPlan chosen. Without knowledge of a specific value for ρPlan a range of values may be investigated. Using the values that Hujoel and Loesche (1990) suggest for caries preventions trials, the investigators choose values for ρWithin of −0.17, −0.1, 0 and +0.02. The corresponding values for ΔPlan are 0.1213, 0.1251, 0.1312 and 0.1326 which are all very small by the Cohen (1988) criteria. Assuming a 2-sided test size of 5% and a power of 80%, then using Table T3 and Equation (13.6a) suggests for the largest ΔPlan = 0.1326, N Mouths = 1 962 2

1 96 + 0 8416 0 13262

2

+

= 448.32 or approximately 450 children with suitable target teeth. For the smaller values of ΔPlan of 0.1213, 0.1251, 0.1312 the corresponding values of

316

13 PAIRED AND MATCHED DESIGNS

Example 13.15

(Continued)

NMouths are 536, 504 and 458 respectively. This range of values for the possible trial size illustrates that an appropriate choice for the value of ρPlan is very critical. Had the design included k = 2 target teeth per segment, then for ρPlan = +0.02 the multiplier introduced into Equation (13.12) would be (1 − ρPlan)/k = (1 − 0.02)/ 2 = 0.49 and this would reduce the above situation of NMouths = 450 to approximately 225 mouths.

13.3.4.2 Binary outcome To estimate the size of a split-mouth design clinical trial with a binary endpoint use can be made of expressions (13.7), (13.8) and (13.9) given for the matched-pair situation. However, in situations in which more than one site is available in each of the two segments of the oral cavity concerned, then the sample size is modified using the multiplier (1 − ρPlan)/k as suggested by Zhu, Zhang and Ahn (2017, p. 2547).

Example 13.16

Split-mouth design – Caries prevention

Suppose we wished to replicate the study of Arrow and Riordan (1995) in another geographical location but using a randomised controlled trial design. The new investigators use the information from the earlier study summarised in Figure 13.10 for planning purposes. That study estimates π GIC = 266/352 = 0.76 and π RFS = 303/352 = 0.86 to be caries-free. The correlation can be estimated from the information within the four cells to give ρB = 0.6. Hence, with ρB = 0.6 Equation (13.7) give planning values p10 = 0.1175 and p01 = 0.0175, from which Equation (13.8) gives pDiff = −0.1 and pDisc = 0.1350. Assuming a 2-sided test size, α = 0.05 and a 1-sided 1 − β = 0.90, Table T3 gives z0.975 = 1.96 and z0.90 = 1.2816. Finally, Equation (13.9) gives N Mouths

1 96 0 1350 + 1 2816 0 1350 − − 0 12 = −0 1

2

= 137.62 or approxi-

mately 140 with 70 mouths randomly assigned to each sequence GIC–RFS and RFS–GIC. When ρB = 0.3 this increases to NMouths = 232. As ρB decreases from 0.6 to zero, the sample size required would increase to NMouths = 326. A pragmatic choice in this situation may be to take a sample size between these two extremes, for example, when ρB = 0.4, NMouths = 200. Had the design included k = 2 target teeth per segment, then for ρB = 0.3 the multiplier introduced into Equation (13.12) would be (1 − ρPlan)/k = (1 − 0.3)/ 2 = 0.35 and this would reduce the sample size to NMouths = 0.35 × 232 = 81.2 or about 80.

13.4 GUIDELINES

13.4

317

Guidelines

Specific CONSORT Guidelines to assist in the reporting (and hence planning) of crossover trials has been given by: Dwan K, Li T, Altman DG and Elbourne D (2019). CONSORT 2010 statement: extension to randomised crossover trials. BMJ, 366, l4378.

CHAPTER 14

Repeated Measures Design

This chapter describes randomised parallel-group designs with repeated measures of the main outcome assessed over time in the individuals recruited to the trial. These longitudinal designs may include repeated pre-intervention as well as repeated postrandomisation assessments over time. As a consequence, repeated measures on the same participant are not independent so these designs are analysed using methods which take account of the inherent correlations (termed auto-correlation) between successive assessments. The corresponding regression models, which may involve both fixed- and random-effects terms, are described. Also included are methods for determining the number of subjects to recruit to such trials. Suggestions for reporting longitudinal trials are included and ideas for graphical presentation made. In addition, trials that include repeated measures on the same trial participant but are not longitudinal in nature are also discussed.

14.1

Introduction

So far in this book, we have based our discussions around the parallel 2-group design in which individual subjects are randomised to the interventions concerned. In particular, in these trials, information collated from the allocated subjects are regarded as independent of each other. One adaptation of this basic design is to keep the original structure, but to use as an outcome a feature that can be repeatedly measured on the individuals over a time interval starting immediately post-randomisation. In this case, successive observations are made on each participant that cannot be regarded as independent of each other. This in-built correlation between successive assessments is termed serial- or auto-correlation and has to be taken note of in determining the design options for the trial, the appropriate sample size, and the eventual analysis. As an example, if the outcome variable in a clinical trial of a new hypertensive agent is the systolic blood pressure (SBP) then this may be ascertained at one fixed time-point post-randomisation, say at week-12 once active treatment is complete. Alternatively, SBP might also be assessed additionally at several stages during the active treatment period as well as at intervals after the intervention is complete. Similarly, repeat observations of SBP might also be included at times preceding the eventual randomisation. Randomised Clinical Trials: Design, Practice and Reporting, Second Edition. David Machin, Peter M. Fayers, and Bee Choo Tai. © 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.

320

14 REPEATED MEASURES DESIGN

Thus, design options for the investigator include one or more repeat observations before randomisation, post-randomisation during the active treatment, and in a follow-up period once active treatment is complete. Typically, these trial designs include v ≥ 0 baseline or other pre-randomisation observations and w > 1 post-randomisation from each subject recruited. In certain circumstances, a repeated measures design can ensure a more efficient comparison between the interventions on test and thereby may result in a reduction in the number of participants that need to be recruited to the associated clinical trial. There are many options for this basic design dependent on the anticipated data profile following each of the interventions under test.

Example 14.1 Heart failure In the trial conducted by Yeo, Yeo, Hadi, et al. (2018), patients with heart failure were asked to complete a 6-minute walk test (6MWT) immediately before (time zero) then at 4- and 12-weeks post-randomisation to a single intravenous dose of either Ferric Carboxymaltose (FCM) or 0.9% Saline regarded as a Placebo (P). The summary profiles of Figure 14.1 suggest an initial increase in the mean 6MWT in both treatment groups over baseline values and thereafter suggest little change beyond week-4.

350 0.9% Saline (Placebo)

300

Ferric Carboxymaltose (FCM)

6-minute walk (m)

250

200 150 100 50 0 0

4 8 Time from randomisation (week)

12

Figure 14.1 Mean 6-minute walk test (6MWT) values at baseline, 4- and 12-week postrandomisation to Placebo or FCM in patients with heart failure. Source: Data from Yeo, Yeo, Hadi, et al. (2018).

14.1 INTRODUCTION

Example 14.2

321

Moderate-to-severe eczema

As we introduced in Example 1.7, Meggitt, Gray and Reynolds (2006) randomised patients with moderate-to-severe eczema to receive either Azathioprine (A) or Placebo (P) in a double-blind formulation to ascertain the relative reduction in disease activity as assessed by the six-area six-sign atopic dermatitis (SASSAD) score between the treatment groups. In fact, SASSAD was measured on several occasions including 2 weeks before, at baseline (time zero) immediately before randomisation, and then post-randomisation at 4, 8 and 12 weeks. For this design, v = 2 and w = 3. They compared the groups using the respective average regression slopes which indicated a 12.0-unit reduction with A and a 6.6 reduction with P over the 12-week period.

Example 14.3 Adult patients with atopic eczema The results of the randomised trial in patients with adult eczema conducted by Reynolds, Franklin, Gray, et al. (2001) indicated non-linear profiles for the reduction in body surface area (BSA). The summary profiles of the three phototherapy intervention groups being compared are shown in Figure 14.2 and reflect a rather uncertain pattern over time.

12

%BSA reduction

10

Narrowband UVB

8

6

Broadband UVA

4 Visible Fluorescent Light

2

0 0

3

6

9

12

24

Weeks

Figure 14.2 Percentage reduction in body surface area (BSA) over time in patients with adult atopic eczema randomised to visible fluorescent light, broadband ultraviolet A (UVA) or narrowband ultraviolet B (UVB). Source: Reynolds, Franklin, Gray et al. (2001). © Elsevier

322

14 REPEATED MEASURES DESIGN

However, as the individual observations are expressed as a percentage of the individual baseline values in Figure 14.2, this presentation distorts the pattern of changing BSA and may give a false impression of the changing scenarios over time. Neither does it give an indication of the baseline variation in BSA. Plotting the mean values of BSA itself within the intervention groups over time is likely to give the reader a better understanding of the underlying profile pre- and post-initiation of therapy. Further, this design may be considered as either (v = 0 and w = 6) or (v = 1 and w = 5) depending on the format of the analysis with respect to whether or not the baseline value is incorporated into the summary profile. If the design includes v = 1, then the baseline is not considered as part of the post-randomisation profile; otherwise if v = 0, it would be regarded as part of the whole profile. Whichever is the case it should be specified in the corresponding trial protocol.

14.2

Simplified analysis

In some situations, although the design may be longitudinal in nature, the analysis of the primary end point chosen for the trial may obviate the need to consider autocorrelation between successive measures.

14.2.1 Change from baseline 14.2.1.1 One baseline – one post-randomisation assessment If the ultimate aim of a longitudinal investigation is, for example, to look at the longterm benefit when comparing the Standard (S) and Test (T) interventions then the investigators may consider the final, or any other single-specific, post-intervention measurement as the main endpoint. In this situation, there is a single trial end point despite earlier, and possibly later, assessments of the same outcome measure. Thus, although the trial is longitudinal in nature and therefore w > 1, the eventual analysis takes no account of that fact. In such situations, the trial protocol must clearly define the component end point as, for example, ‘body surface (BSA) at 24 weeks’ and further that the change from baseline BSA is to be used for evaluation purposes. This corresponds to the simplest repeated measures design in which there is a single prerandomisation (baseline) and a single post-randomisation assessment made from each subject recruited; hence, the design is designated as (v = w = 1) with observations for each subject at baseline, y0i, and at a chosen time t, yti. In this situation, there are two options for the associated analysis. The first essentially regards the change from baseline

14.2 SIMPLIFIED ANALYSIS

323

di = yti − y0i ,

(14.1)

as the basic measure for each subject, from which the mean, d S and dT , of those in each intervention group are obtained. Their difference of differences is calculated as D = dT − dS

(14.2)

The second compares the relative change or ratio (often expressed as a percentage as in Figure 14.2) ri =

yti − y0i y = ti − 1, yoi yoi

(14.3)

as the basic measure for each subject, from which the mean, r S and r T , of those in each intervention group are obtained. Their difference of ratios is calculated as R = rT − rS

(14.4)

Although the yti and y0i observations are likely to be auto-correlated as they are both obtained from the same subject, the N values of di or ri are each obtained from different subjects and so are not correlated. Hence, they are considered independent and so a ttest, or a regression command, is then used for analysis. If this setting is formulated in model terms, then di or ri = β0 + βTreat xi + εi ,

(14.5)

where xi = 0 when the patient receives S and xi = 1 when T is allocated.

Example 14.4 Heart failure The trial conducted by Yeo, Yeo, Hadi, et al. (2018) of Example 14.1 in heart failure patients had a v = 1, w = 2 design for the number of repeat 6MWTs to be conducted and which we denote by Walk0, Walk4 and Walk12.

324

14 REPEATED MEASURES DESIGN

Example 14.4 (Continued) Assume the objective was to compare the change observed in each patient d = (Walk12 − Walk0) and ultimately calculate Dof Equation (14.2). In this situation, the same results are obtained using either the Student’s t-test or linear regression approaches. The associated commands and edited output using these are given in Figure 14.3. These both estimate D = 5.6m with 95% CI −35.2 to +46.5 m and p-value = 0.78. Neither suggesting convincing advantage to FCM over P.

Student’s t-test ttest D, by(Treat) Two-sample t test with equal variances ----------------------------------------------------Group | Obs Mean SD (95% CI) ---------+------------------------------------------Placebo | 23 59.91 67.96 FCM | 21 65.52 66.07 ---------+------------------------------------------d | 44 5.61 20.24 -35.24 to 46.46 ----------------------------------------------------t = 0.28, degrees of freedom = 42, p-value = 0.78

Regression model regress D Treat Number of obs = 44 ----------+-----------------------------------Source | SS df MS F ----------+-----------------------------------Model | 345.57 1 345.57 0.08 Residual | 188915.06 42 4497.98 ----------+-----------------------------------Total | 189260.64 43 ----------+---------------------------------------------+-----------------------------------D | Coef SE (95% CI) ----------+-----------------------------------cons | 59.91 13.98 Treat | 5.61 20.24 -35.24 to 46.46 ----------+-----------------------------------t = 0.28, p-value = 0.78

Figure 14.3 Comparison of 6-minute walk test between Placebo and FCM using the change from baseline to week 12 post-randomisation (a difference of differences) as the outcome measure in patients with heart failure. Source: Data from Yeo, Yeo, Hadi, et al. (2018).

14.2 SIMPLIFIED ANALYSIS

325

14.2.1.2 Mean baseline – mean post-randomisation assessments However, the approaches of Equations (14.1) and (14.3) to the analysis would be similar to a design with v > 1 and/or w ≥ 2 with any multiple observations in the pre- and postrandomisation intervals summarised by their mean, in which case di = yPost,i − yPre,i ,

(14.6)

with an equivalent expression for ri. The comparison of two interventions S and T would then be made using the means of the respective di from the two groups. This method can be further modified to other comparison types such as when postrandomisation observations are taken on w occasions but only the mean value of a selection of these is compared to (say) a single baseline assessment. For example, suppose the baseline value y0 and two of the w > 2 post-randomisation values (say) y2 and y5 are utilised, the latter with mean (y2 + y5)/2. In this case, the end point of interest will be regarded as having weights W0 = −1, W2 = 0.5 and W5 = 0.5 to define the contrast, C = −1 × y0 + 0.5 × y2 + 0.5 × y5, which is then calculated for each subject to become the end point measure for analysis. The mean of these values for S and T are calculated and compared. In general, the contrast for each individual is defined by W k yk ,

C=

(14.7)

k

where the values taken for k will depend on the weights chosen from the v + w observation points concerned. If a particular observation (say) yt is not included, then effectively Wt = 0 in C. Example 14.2 referred to the average regression analysis conducted by Meggitt, Gray and Reynolds (2006). That trial had equally spaced repeated measures at y0, y4, y8 and y12 weeks. If the corresponding times are recoded as 0, 1, 2 and 3, then their mean value is 3/2. Consequently, the regression analysis is essentially the equivalent to using Equation (14.7) in the form C = (0 − 3/2)y0 + (1 − 3/2)y4 + (2 − 3/2)y8 + (3 − 3/2)y12 = −1.5y0 − 0.5y4 + 0.5y8 + 1.5y12. 14.2.1.3 Adjusting for baseline The analysis which models a single yi directly but taking due account of the baseline assessment is formulated by yi = β0 + βTreat xi + γ 0 y0i + εi

(14.8)

326

Example 14.5

14 REPEATED MEASURES DESIGN

Heart failure

If the main focus for the trial of Example 14.1 is on the difference between interventions at week-12, then Walk12 can be modelled directly by setting γ 0 = 0 in Equation (14.8). Figure 14.4 gives bTreat = 33.49m with a wide 95% CI −31.4 to +98.4 and p-value = 0.30. However, if account is taken of the baseline, Walk0, then using the full format of Equation (14.8), the regression model adjusting for baseline, gives bTreat = 9.08m and cWalk0 = 0.88. The estimated difference between FCM and Placebo of 9.08 m is now much less than the previous analysis without the covariate suggesting the possibility of an even smaller true difference between the interventions. The 95% CI −32.1 to +50.3 is narrower but remains wide with an increased p-value = 0.66. In this example, Walk0 plays a very important role since the corresponding t = 0.88/0.11 = 8.10 implies an associated p-value < 0.0001. This is a consequence of the very high correlation between Walk12 and Walk0 of ρ = 0.79.

Regression model taking no account of baseline regress Walk12 Treat ----------+----------------------------------Source | SS df MS F ----------+----------------------------------Model | 12315.44 1 12315.44 1.09 Residual | 476154.11 42 11337.00 ----------+----------------------------------Total | 488469.55 43 ----------+--------------------------------------------+-----------------------------------Walk12 | Coef SE (95% CI) ----------+-----------------------------------cons | 300.70 22.20 Treat | 33.49 32.14 -31.36 to 98.35 ----------+-----------------------------------t = 1.04, p-value = 0.30

Regression model adjusting for baseline regress Walk12 Treat Walk0 ----------+------------------------------------Walk12 | Coef SE (95% CI) ----------+------------------------------------cons | 89.84 29.51 Treat | 9.08 20.39 -32.10 to 50.25 Walk0 | 0.88 0.11 ----------+------------------------------------Treat: t = 9.08/20.39 = 0.45, p-value = 0.66 Baseline: t = 0.88/0.11 = 7.03, p-value < 0.0001

Figure 14.4 Comparison of 6-minute walk test at week 12 post-randomisation to Placebo or FCM in patients with heart failure (a) without and (b) with adjustment for baseline. Source: Data from Yeo, Yeo, Hadi, et al. (2018).

14.2 SIMPLIFIED ANALYSIS

327

In this situation, the baseline assessment y0i is regarded as a covariate with associated regression coefficient γ 0 whose estimate we denote by c0 to emphasise its covariate status. A similar model can be used when w = 1 but there are v > 1 pre-intervention assessments in which case y0i is replaced by the mean of the pre-intervention measures from each patient, y0i . Note also that should γ 0 = 0 then model (14.8) becomes Equation (14.5). Although the analysis using Equation (14.2) in Figure 14.3 and (14.8) in Figure 14.4 clearly differ, both utilise exactly the same patient data, and are clearly not statistically significant. Nevertheless, the latter which estimates the ‘difference’ between groups, rather than the ‘difference of differences,’ is much easier to interpret. The same could be said if the difference between the groups was summarised by a difference of ratios arising from the format of Equations (14.3) and (14.4). For some situations, the intervention takes immediate effect and, for example, quickly lowers the measure of concern to a level which is maintained thereafter. The aim of the trial would be to compare S and T to see whether the corresponding levels differ to a clinically important extent. In this situation, the mean value could be calculated for each patient and then used as the end point measure. In this case, and when a single baseline measure is included, the design is v = 1, w > 1 and the corresponding model is yPost,i = β0 + βTreat xi + γ 0 yPre,i + εi

(14.9)

If multiple pre-randomisation values are also available, then the means from those observations can be used in the regression analysis of the post-randomisation means. In this case, the design has v > 1 and w > 1, and Equation (14.9) is extended to become yPost,i = β0 + βTreat xi + γ 0 yPre,i + εi

(14.10)

However, use of models such as (14.10), which do not deal with the individual observations (only their means), may lead to some loss of information and so may not describe the differences between the interventions concerned adequately.

Example 14.6 Patients with phenylketonuria Levy, Milanowski, Chakrapani, et al. (2007) compared sapropterin dihydrochloride (S) with (double-blind) placebo (P) in patients with phenylketonuria to assess its role in reducing blood phenylalanine concentration and therefore its potential for preventing mental retardation in these patients. Their design consisted of four pre-treatment initiation and four post-randomisation measures, taken at Screen and weeks −2, −1, 0 (baseline), 1, 2, 4 and 6 weeks, hence v = w = 4. An illustration of their results is given in Figure 14.5 which suggests that post-randomisation blood phenylalanine levels remain unchanged at about 850 μmol/L in those

328

14 REPEATED MEASURES DESIGN

Example 14.6 (Continued) randomised to receive P. For those receiving S, there is an immediate drop by Week 1 to a level of approximately 600 μmol/L, which is then maintained thereafter. Although the post-randomisation individual profiles of the patients are not illustrated in Figure 14.5, it seems reasonable to assume they may be fluctuating about specific levels in each of the P and S groups. Therefore, summarising the four post-randomisation observations from each subject by their mean would appear sensible. In practice, one may wish to first view the individual profiles before concluding that this is indeed reasonable (see the cautionary note in Section 14.8 however). The subsequent analysis then takes the same format as that of Figure 14.4 except that Walk12 and Walk0 are replaced by yPost,i and yPre,i as in Equation (14.10).

1200 Pre-randomisation

Placebo n = 47

Blood phenylalanine (umol/L)

1000

800

600

400

Sapropterin n = 41

200

0 Screen

–2

–1

0

1

2

4

6

Time from randomisation (weeks)

Figure 14.5 Mean blood phenylalanine concentration levels pre-randomisation and postrandomisation, after being assigned to Placebo (P) or Sapropterin (S), in patients with phenylketonuria. Source: Levy, Milanowski, Chakrapani, et al. (2007) © Elsevier.

14.3 REGRESSION MODELS

14.3

329

Regression models

We discuss below some specific repeated measures design options but whatever the design chosen the subsequent analysis will be conducted using a regression model of some type. This section also describes some general features of models that may be used. 14.3.1 Fixed effects The simple linear regression model of Equations (2.1) and (8.13) has two parameters β0 and βTreat which are to be estimated from the data arising from the associated clinical trial. The associated calculation assumes that all patients have the same underlying values of β0 and βTreat. In this situation, the model is termed: ‘fixed-effect.’ 14.3.2 Trends over time In a repeated-measures trial, the basic end point measure is assessed on more than a single occasion and at times specified by the chosen design. Thus, the two-group parallel comparative trial comparing S and T interventions may involve w postrandomisation observations of the same measure from each subject recruited to the trial. The object of T, once initiated, is to cause a change in these values as compared to those from S – perhaps to lower them by a fixed quantity. In this case, with some notational changes, the simple model of Equation (2.1) for continuous data is extended to yit = β0 + βTreat xi + βTime t + ηit ,

(14.11)

where there are i = 1, 2, …, N = nS + nT subjects and there are w repeat observations at time-points t = t1, t2, …, tw. Randomisation occurs at time t = 0, so nT individuals are allocated T, which implies xi = 1. Similarly, xi = 0 for the nS individuals assigned to S. This model implies that prior to randomisation, all subjects will commence at the same (fixed) baseline with a fixed population value of β0, while post-randomisation the two groups then differ by βTreat in their population means although the end point measure changes linearly with time depending on the value of βTime. If time has no effect then βTime = 0 and the end point measures fluctuate about a fixed value over time at a level depending on the assigned intervention. In Equation (2.1), we stated that ε represents the noise (or error) and this is assumed to be random and have a mean value of 0 across all subjects recruited to the trial, and to have a standard deviation (SD), σ. However, in the repeated measures Equation (14.11), ε is replaced by η because this ‘error’ term now contains both variation between different participants in the trial as well as variation resulting from measures taken within each of the individuals concerned.

330

14 REPEATED MEASURES DESIGN

14.3.3 Random effects An extension of Equation (14.11) is to assume that each recruited individual i has a regression line with common slope βTime but with individual intercept values β∗0,i . Assuming no pre-randomisation assessments as in Equation (14.11), the model becomes yit = β∗0,i + βTreat xi + βTime t + η∗it

(14.12)

This is known as a mixed model as it has fixed components such as βTreat and βTime and a (so-called) random effects component, β∗0,i . Here, the intercepts are assumed to follow an independent Normal distribution with mean β∗0 and variance, σ 20 . As a consequence, the random error term is now denoted η∗it as it now contains a component attributable to the variation in the intercept as well as within and between subject variation. Another possibility is to extend Equation (14.12) to assume that each recruited individual i also has their own regression slope, β∗Time,i in which case the mixed model becomes yit = β∗0,i + βTreat xi + β∗Time,i t + η∗∗ it

(14.13)

Here, the slopes are assumed to follow an independent Normal distribution with mean β∗Time and variance, σ 2Time. As a consequence, the random error term is denoted η∗∗ it as it now contains a component attributable to the variation in the intercept and slope added to within and between subject variation. However, the main focus of the statistical analysis remains as before, and that is to compare the intervention groups. That is to estimate the fixed effect βTreat, the corresponding CI, and to test the null hypothesis βTreat = 0. As a repeated measures design can also include a pre-randomisation or baseline observations, y0i, then the model can account for this by extending the right-hand side of Equation (14.13) to become yit = β∗0,i + βTreat xi + β∗Time,i t + γ 0 y0i + η∗∗ it ,

(14.14)

where γ 0 is the regression coefficient corresponding to the baseline measure. Essentially this model now implies that the value of a (post-randomisation) observation y of an individual depends on the intervention received, the time when the observation was made, and also the initial value of that observation before the allocated intervention is initiated. The above models can be extended to also allow for pre-baseline measures as covariates. However, as the (final) baseline measure would be taken just before t = 0, the earlier baseline values will have an associated t which is negative. For example, in the

14.4 AUTO-CORRELATION

331

trial of Example 14.2, the patients with moderate-to-severe eczema had one assessment 2 weeks prior to baseline (time −2) and one at baseline (time 0), hence v = 2. Further there are three post-randomisation measures at 4, 8 and 12 weeks so w = 3. In this example, the model takes the form of yit = β∗0,i + βTreat xit + β∗Time,i t + γ − 2 y − 2,i + γ 0 y0i + η∗∗ it

(14.15)

However, in many instances, if there are several baseline measures, they may be summarised by their mean value for each individual at the analysis stage of the trial.

14.4

Auto-correlation

In describing the models of the previous section, we have not discussed the autocorrelation implicit in longitudinal designs. However, as we have indicated when describing model (14.11) the error term, ηit, contains components of both between individuals and within an individual variation. Although the between subjects component of this variation will be independent from subject to subject, the within component is not likely to be independent as successive observations from the same individual are likely to be correlated to some extent. Thus, when fitting the relevant statistical model for the chosen design some account of this correlation has to be made. 14.4.1 Auto-correlation coefficient The usual measure of association between two variables, say height (h) and weight (w), is the Pearson correlation coefficient. In assessing the strength of the auto-correlation between measures taken of the same variable repeated at two different times in the same individual the Pearson correlation is also used but with y1 recorded at t1 replacing h and y2 recorded at t2 replacing w. It is then estimated by ρ y 1 , y2 =

y 1 − y1 y2 − y2 y 1 − y1

2

y 2 − y2

2

(14.16)

A key consideration then in designing and subsequently analysing a trial involving repeated measures is the nature and strength of this auto-correlation. 14.4.2 Patterns of auto-correlation The problem for the investigators is that the properties of a repeated measures design depend on ρ. The value of this, and how it changes with the time-interval between observations, may be hard to pinpoint.

332

14 REPEATED MEASURES DESIGN

14.4.2.1 Independent In the special case, where successive observations on the same individual can be regarded as independent then there is clearly no auto-correlation present and ρ = 0 for whichever pair of different time-points we choose to compare. Thus, the correlation matrix of Figure 14.6a has a diagonal which contains unity in every position (since every observation is perfectly correlated with itself ) but is 0 whenever an observation is compared with a previous or subsequent one of the same (repeated) measure. In this special case, auto-correlation need not be estimated and so the analytical process is at its simplest. For the simple linear regression model of Equation (2.1), this leads to estimates of the regression coefficients given by Equations (2.2) and (2.3) and for the residual standard deviation by (2.4). Since the beneath and above diagonal entries of the correlation matrix of Figure 14.6a are symmetric (and this is true in all situations) only the half-diagonal form (given in bold) is usually presented. 14.4.2.2 Exchangeable If the assumed auto-correlation between measurements made at any two arbitrarily chosen times, say time t1 and time t2, has the same value for ρ (yt1, yt2) whatever values (a) Independent

(b) Exchangeable

----+------------------------------| y1 y2 y3 y4 y5 y6 ----+------------------------------y1 | 1 0 0 0 0 0 y2 | 0 1 0 0 0 0 y3 | 0 0 1 0 0 0 y4 | 0 0 0 1 0 0 y5 | 0 0 0 0 1 0 y6 | 0 0 0 0 0 1 ----+-------------------------------

----+------------------------------| y1 y2 y3 y4 y5 y6 ----+------------------------------y1 | 1 y2 | ρ 1 y3 | ρ ρ 1 y4 | ρ ρ ρ 1 y5 | ρ ρ ρ ρ 1 y6 | ρ ρ ρ ρ ρ 1 ----+-------------------------------

(c) Autoregressive

(d) Unstructured

----+----------------------------| y1 y2 y3 y4 y5 y6 ----+----------------------------y1 | 1 y2 | ρ 1 ρ 1 y3 | ρ 2 3 ρ2 ρ 1 y4 | ρ 4 3 ρ ρ2 ρ 1 y5 | ρ ρ4 ρ3 ρ2 ρ 1 y6 | ρ 5 ----+-----------------------------

----+------------------------------| y1 y2 y3 y4 y5 y6 ----+------------------------------y1 | 1 1 y2 | ρ 21 ρ 32 1 y3 | ρ 31 ρ 42 ρ 43 1 y4 | ρ 41 ρ 52 ρ 53 ρ 54 1 y5 | ρ 51 ρ 62 ρ 63 ρ 64 ρ 65 1 y6 | ρ 61 ----+-------------------------------

Figure 14.6 Examples of possible auto-correlation matrix structures arising from 6 equally timespaced observations in a repeated measures design. (a) Independent, (b) exchangeable, (c) autoregressive, and (d) unstructured

14.4 AUTO-CORRELATION

333

of t1 and t2 we happen to choose, then we can write ρ(yt1, yt2) = ρ and the autocorrelation is termed exchangeable. Other terms used for this are ‘compound symmetry’ and ‘uniform’ correlation. As the magnitude and/or sign of ρ does not depend on the choice of the times we choose to compare, the correlation matrix is of the form of Figure 14.6b. This is clearly a simple pattern but nevertheless implies that, for example, to fit a simple linear model, we have the parameter, ρ, to estimate in addition to β0, βTreat and σ. Although there is a choice of names for this pattern, we use ‘exchangeable’ in what follows. 14.4.2.3 Autoregressive In some situations, it may be supposed that as the time between observations increases the auto-correlation between observations at those times will decrease. Thus, we might assume, that the auto-correlation takes the form ρ yt1 , yt2 = ρ t 2 − t 1 , where |t2 − t1| is the absolute value of the difference between t1 and t2, that is, whether the time difference is negative or positive we give it a positive value. Consequently, provided |ρ| < 1, ρ(yt1, yt2) will decline as the interval |t2 − t1| increases. The correlation structure for the special case when there are equal intervals between successive measurements is given in Figure 14.6c. If this pattern is assumed with (say) ρ = 0.5 then corresponding auto-correlations between measurements taken on the first day (Column y1) and the successive days (Rows y1, y2, …, y6) would be 0.51 − 1 = 0.50 = 1, ρ = 0.52 − 1 = 0.51 = 0.5, ρ2 = 0.53 − 1 = 0.52 = 0.25, …, ρ5 = 0.56 − 1= 0.55 = 0.03125 and is clearly decreasing as the time interval increases. 14.4.2.4 Unstructured In this case, there is no consistent pattern over time in the auto-correlation matrix so that in the example of Figure 14.6d there are 15 distinct correlation coefficients to estimate. As one might imagine, this situation requires more complex algorithms to conduct the necessary calculations and, in practice, this difficulty often restricts the types of studies that can be analysed under the unstructured assumption.

Example 14.7 6MWT in patients with heart failure Figure 14.7 shows the pair-wise scatter plots of the associations of 6MWT assessed on three occasions in patients with heart failure. The three correlations are of similar magnitude, all close to ρ = 0.8.

334

14 REPEATED MEASURES DESIGN

Example 14.7

(Continued)

graph Matrix Walk0 Walk4 Walk12, half

6MWT test distance at baseline 600

6MWT distance at 4 weeks

400 200 0 600

6MWT test distance at 12 weeks

400 200 0 0

200

400 0

200

400

600

pwcorr Walk0 Walk4 Walk12 ---------+-------------------------| Walk0 Walk4 Walk12 ---------+-------------------------Walk0 | 1 Walk4 | 0.8547 1 Walk12 | 0.7896 0.8427 1 --------+--------------------------

Figure 14.7 Pairwise correlations between 6MWT assessments at baseline, 4 and 12 weeks in patients with heart failure. Source: Data from Yeo, Yeo, Hadi, et al. (2018).

14.5

Accounting for auto-correlation

As we have indicated, the models of Section 14.2 avoid the need to specify a particular type of auto-correlation whereas this is required of a truly longitudinal analysis. Thus, in fitting the corresponding models, it is important to link the successive longitudinal assessment measures to the individual concerned. Thus, the statistical package commands will include a variable ‘id,’ the values of which represent the unique identifier of each subject within the trial concerned. As in repeated-measures trials fixed, random

14.5 ACCOUNTING FOR AUTO-CORRELATION

335

and both fixed and random effects are likely to occur within the possible analysis strategies. Stata provides the single command type (mixed) to be used for the three situations. 14.5.1 Linear profiles In some circumstances, it may be anticipated that the intervention will induce a gradual (and linear) change in the end point measure over time so that the corresponding slopes following the two interventions may be compared. Frison and Pocock (1997) term this situation as having ‘linearly divergent treatment effects.’

Example 14.8 Heart failure – influence of time on 6MWT In contrast to Example 14.4 which utilised the Walk12 post-randomisation levels as the outcome, Figure 14.8 describes the edited commands and output for when the individual data values are utilised and it is presumed that post-intervention the change in Walk will be gradual over time beginning with the Walk0 value. In this situation, the baseline value is regarded as the first of the post-intervention assessments. Ignoring the specific intervention allocated, and thus using Equation (14.12) with βTreat set to 0, Figure 14.8a using the command (mixed Walk Time,|| id:) suggests a gradual increase in Walk over Time of bTime = 4.649m per week which is highly statistically significant. As Figure 14.8b implies, the assumption of a fixed (common) slope applies to all patients irrespective of the treatment group. However, the id: following the || in the command indicates a random effect for the intercept so that each individual patient slope may begin at different intercept values.

(a) Random (varying) intercept but fixed (common) slope for Time mixed Walk Time, || id: Number of obs = 138, Number of groups = 49 (SE adjusted for 49 clusters in id) -----------------------------------------------------------------| Robust Walk | Coef SE z P>|z| (95% CI) ------------+----------------------------------------------------con b*0 | 261.737 14.366 Time bTime| 4.649 0.786 5.92 0.0001 3.108 to 6.190 ------------------------------------------------------------------

Example 14.8 (Continued) (b) Scatter plots for Placebo and FCM from selected subjects with associated regression lines of common slope 450

450

Ferric Carboxymaltose (FCM)

6-minute walk (m)

0.9% Saline (Placebo) 400

400

350

350

300

300

250

250

200

200

150

150

100

100

50

50 0

0 0

4

0

12

Time from randomisation (weeks)

4 Time from randomisation (weeks)

12

(c) Random (varying) intercept and random (differing) slope for Time mixed Walk Time, || id: Time, covariance(exchangeable) ----------------------------------------------------------------Walk | Coef SE z P>|z| (95% CI) -----------+----------------------------------------------------* con b 0 | 263.370 13.760 Time b*Time | 4.883 1.820 2.68 0.007 1.315 to 8.451 -----------------------------------------------------------------

(d) Scatter plots for Placebo and FCM from selected subjects with associated regression lines with random intercept and random slope

6-minute walk (m)

450

450

0.9% Saline (Placebo)

400

400

350

350

300

300

250

250

200

200

150

150

100

100

50

50

0

Ferric Carboxymaltose (FCM)

0 0

4 Time from randomisation (weeks)

12

0

4

12

Time from randomisation (weeks)

Figure 14.8 Mixed models for the 6-minute walk test (Walk) with random intercept with: (a and b) fixed slope for Time; (c and d) random slope for Time. Source: Data from Yeo, Yeo, Hadi, et al. (2018).

14.5 ACCOUNTING FOR AUTO-CORRELATION

337

To also allow for random slope, the command is extended to include the specified auto-correlation, here assumed exchangeable, to (mixed Walk Time, || id:Time, covariance(exchangeable)). The term covariance refers to the numerator of Equation (14.16) which links the two auto-correlated values y1 and y2. In the command, Time follows || id: to indicate that here it is also regarded as a random effect. As shown in Figure 14.8c, the gradual increase in Walk over Time is now estimated as the mean of the random slopes to give b∗Time = 4.883 m per week while Figure 14.8d shows a diverse range of individual slopes. Although omitted in Example 14.8, the calculations of the standard errors (SEs) of the regression coefficients from data in a repeated measures design are quite sensitive to the assumptions made when defining the model to fit. However, so-called robust methods of calculating the SEs are available. In Stata, the specific command is variancecovariance estimate vce(.) which takes note of the subject id within the associated cluster and thus identifies individual longitudinal profiles for each subject. Specifically, the command takes the form vce(cluster id).

Example 14.9 Heart failure – comparing Placebo with FCM Although evidence of increasing Walk with Time is of importance, the main focus of the trial is the fixed-effect comparison of FCM and P. Thus, Treat is included only in the first, fixed part, of the model, that is, it precedes ‘, ||’. The corresponding model shown in Figure 14.9a which utilises the full Equation (14.13), assumes

(a) Exchangeable auto-correlation mixed Walk Treat Time, || id: Time, covariance(exchangeable) vce(cluster id) --------------------------------------------------------------| Robust Walk | Coef SE z P>|z| (95% CI) --------+-----------------------------------------------------cons | 256.288 12.293 Treat | 14.432 27.374 0.53 0.60 -39.221 to 68.085 Time | 4.891 0.861 ---------------------------------------------------------------

(b) Unstructured auto-correlation mixed Walk Treat Time, || id: Time, covariance(unstructured) vce(cluster id) --------------------------------------------------------------| Robust Walk | Coef SE z P>|z| (95% CI) --------+-----------------------------------------------------cons | 257.386 12.297 Treat | 9.067 27.958 0.32 0.75 -45.729 to 63.863 Time | 4.592 0.810 ---------------------------------------------------------------

Figure 14.9 Model for the 6-minute walk test (Walk) comparing with fixed Treat allowing for Time (a) assuming exchangeable and (b) unstructured auto-correlation. Source: Data from Yeo, Yeo, Hadi, et al. (2018).

338

14 REPEATED MEASURES DESIGN

Example 14.9

(Continued)

exchangeable auto-correlation and includes the vce(.) option. This indicates an advantage to FCM over P of 14.432 m but, with a p-value = 0.60, which is not statistically significant and the 95% CI is correspondingly very wide and covers zero. Although the results in Figure 14.9b, which assume an unstructured autocorrelation, differ somewhat from those obtained using the exchangeable assumption, the conclusion remains the same in that no demonstrable gain of FCM over P has been established.

The use of Walk0 as part of the post-randomisation profile in Examples 14.8 and 14.9 may not be very realistic as Figure 14.1 tends to suggest that in both the P and FCM groups that, despite the rise in 6MWT by Week-4, there is little change thereafter. Thus, the assumption that 6MWT is rising linearly from the Walk0 values is questionable. In such cases, the baseline value (Walk0) might be considered as a covariate in any associated models for analysis. In this situation, Walk0 needs to be distinguished from the sequence denoted Walk (comprising Walk0, Walk4 and Walk12) because now it is no longer regarded as a follow-up variable. Hence Walk0 is renamed as BaseWalk in the output of Figure 14.10 which considers a random intercept with random slope model. In this analysis, bTreat = −1.607 is smaller and opposite in sign than the estimate of 14.432 of Figure 14.9 although both estimates have very wide confidence intervals with associated non-statistically significant conclusions.

14.5.2 Non-linear profiles In other situations, perhaps the pattern first mimics a gradual and linear reduction and then, after a certain length of time, there is a plateau in each intervention group. In such a case, investigators may be interested in comparing the initial slopes as well as the ultimate levels attained by the two interventions.

14.6

The design effect (DE)

When it comes to estimating the number of subjects to recruit into a repeated-measures trial, the sample size formulae of Chapter 10 need to be modified, the extent of which depends on the particular design chosen. In essence, the sample size calculated, using

14.6 THE DESIGN EFFECT (DE)

339

mixed Walk Treat Time BaseWalk, || id:Time, covariance (exchangeable) vce(cluster id) Number of obs = 89, Group variable: id, Number of groups = 46 Obs per group: min = 1, avg = 1.9, max = 2 --------------------------------------------------------------------| Robust Walk | Coef SE z P>|z| (95% CI) --------------+-----------------------------------------------------cons (b0*)| 79.106 24.998 –31.430 to 28.216 Treat (bTreat)| -1.607 15.216 -0.11 0.92 * Time (b Time)| 0.584 1.031 BaseWalk(c0)| 0.905 0.085 10.63 0.0001 ---------------------------------------------------------------------

Figure 14.10 Model for comparing 6-minute walk test (Walk) values between (fixed) Treat allowing random intercept and random slope for Time with baseline covariate BaseWalk. Source: Data from Yeo, Yeo, Hadi, et al. (2018).

the same considerations for the anticipated effect size, test size and power, is retained but is then multiplied by what is termed the Design Effect (DE). In what follows an exchangeable auto-correlation is assumed. 14.6.1 Change from baseline This design is specified by v ≥ 1 pre-randomisation assessments which are then summarised by a single observation or their mean as appropriate and compared with one or more post-randomisation assessments (w ≥ 1) also summarised by eithera single observation or a mean. In these situations, model (14.6) is appropriate and the corresponding DE (described as CHANGE by Frison and Pocock, 1992, p. 1693) is DECHANGE = −

v + 1 ρ−1 1 + w−1 ρ + = v w

1 1 + v w

1−ρ

(14.17)

Although the first form of this expression emphasises the individual contributions of the pre- and post-randomisation longitudinal components, the second illustrates that as ρ increases positively DECHANGE decreases as either, or both, v and w increase.

340

14 REPEATED MEASURES DESIGN

Example 14.10

Design effect – heart failure

If, at the analysis stage of the trial conducted by Yeo, Yeo, Hadi, et al. (2018), only the Week12 assessment is the focus then, with a single baseline Walk0, Equation (14.17) becomes with v = w = 1, DECHANGE = 2(1 − ρ). On the other hand, if the mean of the two post-randomisation assessments is utilised then v = 1, w = 2 and DECHANGE = 3(1 − ρ)/2.

14.6.2 Adjusting for baseline This design is specified by v ≥ 0 pre-randomisation assessments which are then either absent (v = 0), summarised by a single observation (v = 1) or several observations (v ≥ 2) based on their mean, followed by one or more post-randomisation assessments (w ≥ 1) also summarised by either a single observation, a mean of several observations, or a particular contrast resulting in a single summary. In the situations, when v 0, model (14.8) is appropriate and the corresponding DE (described as ANCOVA by Frison and Pocock, 1992, p. 1693) is DEANCOVA = −

vρ2 1 + w−1 ρ + w 1 + v−1 ρ

(14.18)

The acronym ANCOVA refers to the analysis of covariance which is essentially the same as linear regression in which adjustment is made for a baseline which is a continuous variable. For the case of no immediate pre-randomisation or earlier observations, v = 0 and the baseline regression coefficient is set to γ 0 = 0 in Equation (14.8). In which case, Equation (14.18) (now described as POST) becomes DEPOST =

1 + w−1 ρ w

(14.19)

When w = 1 there is only one post-randomisation measurement, DEPost = 1 and the design reduces to that of comparing two independent groups. When v = 0 and w > 1, DEANCOVA = DEPOST and these both increase with positively increasing ρ. This is because when v = 0, we are only considering the mean of the post-treatment measurements, and the variance of this mean increases as ρ increases. However, DEANCOVA reduces with increasing ρ when v > 0. In general, as v increases, DEANCOVA decreases, as does its value as w increases. When a baseline measurement is made, hence v ≥ 1, the effect is to reduce DEANCOVA. In contrast, as ρ increases, DEANCOVA tends to first increase to a maximum and thereafter declines.

14.6 THE DESIGN EFFECT (DE)

Example 14.11

341

Design effect – phenylketonuria – atopic eczema

The trial of patients with phenylketonuria by Levy, Milanowski, Chakrapani, et al. (2007) had repeat observations at Screen and at −2, −1, 0, 1, 2, 4 and 6 weeks for each individual, hence v = w = 4, in which case, from (14.18), the DEANCOVA = −

4ρ2 1 + 3ρ

+

1 + 3ρ 4

. The trial of Reynolds, Franklin, Gray, et al.

(2001) in patients with adult atopic eczema had a single baseline, 0, and then five repeat observations at 3, 6, 9, 12 and 24 weeks. Thus v = 1, w = 5 and DEANCOVA = − ρ2 + 1 +5 4ρ . The changes in the value of DEANCOVA for these two situations for increasing ρ are shown in Figure 14.11. In these two examples, the maximum values of DEANCOVA occur when ρ = 0.17 and 0.4 respectively.

0.35

DE - ANCOVA

v = 1, w = 5 Reynolds, Franklin, Gray, et al, (2006)

0.30

v = 4, w = 4

0.25

Levy, Milanowski, Chakrapani, et al, (2007)

0.20 0

0.1

0.17

0.3

0.4

0.5

Rho

Figure 14.11 Variation in the values of the design effect of Equation (14.18) for the designs used in the trials conducted. Sources: Levy, Milanowski, Chakrapani, et al. (2007); Reynolds, Franklin, Gray, et al. (2001).

342

14 REPEATED MEASURES DESIGN

14.6.3 Post-intervention trends If an investigator was interested in comparing two slopes with the time that each postrandomisation measurement is specified as: t1, t2, …, tw, then w

D2t

tj −t

=

2

w

t 2j

=

j=1



j=1

2 w J = 1t j

w

(14.20)

is calculated. The corresponding DE assuming exchangeable auto-correlation is given by Diggle, Heagerty, Liang and Zeger (2002, p. 30) as DESlope =

1−ρ D2t

(14.21)

As we have indicated previously, in such a design, a decision is required at the planning stage as to whether or not any baseline observation taken is to be included in the regression slope calculation. In circumstances where the interventions may be ‘slow’ acting, in which case the baseline may be reasonably anticipated to assist in the estimate of the resulting slope. If the intervention is ‘quick’ acting, perhaps resulting in an immediate change in the measurement levels, then it may be better to omit this baseline measure from the slope calculation but use it as a covariate.

Example 14.12

Design effect – atopic eczema

If we ignore the single baseline measure and consider the trial of Reynolds, Franklin, Gray, et al. (2001) in patients with adult atopic eczema then we can consider the five repeat observations at 3, 6, 9, 12 and 24 weeks that are included to estimate the slopes of the corresponding regression lines within each treatment. In which case with w = 5, Equation (14.9) gives D2t = 262.8 and therefore from Equa1−ρ . tion (14.21), DESlope = 262 8

14.6.4 Selected contrasts In general situation, where contrasts of the form of Equation (14.7) are considered, the corresponding variance of the contrast, C, involves the variance of each of the component weights Wk, and its covariance with every other weight, Wl. This is given by l = c−1 k = c−1

Var C = l=0

k=0

W l W k θlk ,

(14.22)

14.6 THE DESIGN EFFECT (DE)

343

where c is the number of weights in the contrast concerned. In the case of exchangeable auto-correlation, the symmetry implies that θlk equals σ 2 when l = k and equals ρσ 2 when l k. This then leads to a DE of DEContrast = Var C σ 2

(14.23)

In the case when there are three weights, Equation (14.22) becomes Var C = W 20 + W 21 + W 22 σ 2 + 2 W 0 W 1 + W 1 W 2 + W 2 W 0 ρσ 2 .

Example 14.13

DE for a specific contrast

Figure 14.12 gives an example of Equation (14.23) where the weights chosen for a c = 3 component contrast are W0 = −1, W1 = W2 = 0.5. Such a contrast essentially implies calculating for each patient the difference between the mean of the postrandomisation assessments denoted y1 and y2 with the baseline assessment denote y0. In this case, the corresponding DEContrast = 1.5(1 − ρ). This will be 1/3. Weights

W0 = −1

W0 = −1

W02 𝜎2 = 𝜎2

W1 = 0.5

W2 = 0.5

W0W1𝜌𝜎2 = –0.5𝜌𝜎2

W0W2𝜌𝜎2 = –0.5𝜌𝜎2 W1W2𝜌𝜎2 = 0.25𝜌𝜎2

W1 = 0.5

W1W0𝜌𝜎2 = –0.5𝜌𝜎2

W12𝜎2 = 0.25𝜎2

W2 = 0.5

W2W0𝜌𝜎2 = –0.5𝜌𝜎2

W2W1𝜌𝜎2 = 0.25𝜌𝜎2

W22𝜎2 = 0.25𝜎2

Var (C) = (1 + 0.25 + 0.25) σ 2 + (−0.5 −0.5 −0.5 + 0.25 −0.5 + 0.25)ρσ 2 = 1.5(1 – ρ)σ 2.

Figure 14.12 Structure of the anticipated variance, Var(C) = [SD(C)]2, for a contrast, C, with 3 Weights: −1, 0.5 and 0.5 and assuming exchangeable auto-correlation

When there are w post-randomisation observations made with all weights Wk = 1/w then Equation (14.23) becomes the DEPOST of Equation (14.19). When the weights are defined by W k =

tk − t k

tk − t

, then this is essentially the situation when the slope is to be

2

estimated. This therefore leads to DESlope of Equation (14.21). 14.6.5 Comparing design effects As we have indicated, the design team are required to choose the particular structure for their repeated-measures trial and so have to choose one of the options as outlined by

344

14 REPEATED MEASURES DESIGN Design CHANGE: If the baseline assessment VAS0 is included (v = 1) and the endpoint is defined by the paired difference of VAS0 and the mean of the w = 4 post-randomisation assessments then, from equation (14.17), DECHANGE = 5 (1 − ρ)/4. This will decrease as ρ increases Design ANCOVA: If VAS0 is used as a covariate then v = 1, and with w = 4, and applying equation (14.18) gives DEANCOVA = (1 + 3ρ − 4ρ2)/4. The maximum Max = 0.3906. value of this occurs when ρ = 0.375 in which case DEANCOVA

Design POST: If the baseline assessment VAS0 is ignored (v = 0) but the end point is defined by the mean of the w = 4 post-randomisation assessments then, from equation (14.19), DEPOST = (1 + 3ρ)/4 which will increase as ρ increases.

Figure 14.13 Design effects for repeated measure designs

Frison and Pocock (1992, 1997). We use the trial of Bowman, Everett, O’Dwyer, et al. (2017) to illustrate the different design options in Figure 14.13.

14.7

Trial size

The sample size calculations described here make the assumption of a fixed-effect for the change with Time. However, Fitzmaurice, Laird and Ware (2011, p. 588) indicate that the assumption of a fixed, rather than a random effects slope, is likely to give an underestimate of the sample size, and so these estimates should be regarded as a minimum requirement. 14.7.1 Continuous outcomes For a continuous outcome measure with an anticipated effect size, δPlan, σ Plan, a 1: φ allocation, 2-sided test α, and power 1 − β, the formula for estimating sample size is an extension of Equation (9.5) with the DE as a multiplier to give N = DE ×

1+φ φ

2

z1 − α

2

+ z1 − β

ΔPlan

2

2

+

z 21 − α 2

2

,

(14.24)

where the DE chosen depends on the design of the proposed longitudinal trial.

Example 14.14

Primary Sjögren’s syndrome

Bowman, Everett, O’Dwyer, et al. (2017) conducted a clinical trial in which 133 patients with primary Sjögren’s syndrome were randomised equally to Placebo (P)

14.7 TRIAL SIZE

345

Example 14.14

(Continued)

and Rituximab (R). Their repeated measures design with observations made at 0, 16, 24, 36 and 48 weeks showed little evidence of either change over time or between interventions in, for example, fatigue which was recorded using a visual analogue scale (VAS). If no baseline measures are taken (v = 0), but with only a single outcome assessment (w = 1) then DE = 1. Assuming a 2-sided α = 0.05, 1 − β = 0.8 and, if we specifically choose, ΔPlan = δPlan/σ Plan = 0.565 then Equation (14.24) with φ = 1 gives a total sample size N = 100.27. This corresponds to using Equation (9.5) when the model for estimating the treatment effect is essentially the independent t-test. This is the planning sample size obtained by the investigators. Had they designed their (v = 1, w = 4) trial with an ANCOVA design then, as we have shown in Figure 14.13, the maximum value of DEANCOVA = 0.3906 is when ρ = 0.375. Hence the required sample size is reduced to NANCOVA = 100.27 × 0.3906 = 39.17 or 40 patients with 20 assigned to each group. However, this implies v + w = 5 assessments per individual so that the total number of assessments to be made is AANCOVA= 5 × NANCOVA = 200. The profile of sample size required with changing ρ for this design is illustrated in the lower curve of Figure 14.14.

100 v = 0, w = 1

Number of Patients, N

v = 1, w = 1 75

v = 1, w = 2 50 v = 1, w = 3 40

25 v = 1, w = 4

0

0.1

0.2

0.3 0.375

0.5

0.6

0.7

0.8

0.9

Rho

Figure 14.14 Variation in sample size for a continuous outcome variable as a function of the chosen ANCOVA repeated measures design (v, w) for a given exchangeable auto-correlation, ρ. This uses Equation (14.24) with 2-sided α = 0.05, 1 − β = 0.8 and φ = 1

346

14 REPEATED MEASURES DESIGN

If the design were CHANGE, then retaining ρ = 0.375, DECHANGE = 0.7813 and consequently NCHANGE = 78.34 or 80 patients with ACHANGE = 5 × 80 = 400. Similarly, using POST, DEPOST = 0.5313, NPOST = 53.27 or 54 with APOST = w × NPOST = 4 × 54 = 216. Figure 14.14 also shows the changing profile of NANCOVA as the number of postrandomisation assessments w changes from 1 to 4. In the absence of knowledge of the actual value of the auto-correlation, investigators may choose such a conservative approach knowing that this ‘maximum’ sample size calculated assuming ρ = ρMax will be sufficient for the design whatever the true value of ρ happens to be. 14.7.2 Binary outcomes If the outcome variable of the 2-group design is binary, such as when a satisfactory response to treatment either is or is not observed, then the total number of subjects required with w repeat post-randomisation measures, for anticipated difference δ = π T − π S, is obtained by modifying Equation (9.6) using a specified DE as implied by Diggle, Heagerty, Liang and Zeger (2002, p. 31) to give N = DE ×

1+φ φ

z1 − a

2

1 + φ π 1 − π + z1 − β πT − πS

φπ S 1 − π S + π T 1 − π T

2

2

(14.25) Here π S and π T are the proportions anticipated to respond in the respective S and T groups and π = π S + φπ T2 1 + φ . The number to be recruited to the two groups are nS = N/(1+φ) and nT = Nφ/(1+φ).

Example 14.15

Sample size – primary Sjögren’s syndrome

Although the trial of Bowman, Everett, O’Dwyer, et al. (2017) utilised continuous measures they state: ‘The primary endpoint was the achievement of a reduction of at least 30% relative to the baseline measurement in the patient completed VAS assessments … at week 48.’ Thus, although a repeated measures design, the trial sample size was determined on the basis of the change from baseline to the final assessment. The author’s state: ‘The predefined minimum clinically important effect of rituximab was an increase in treatment response rate from 20% in the placebo arm to 50% in the rituximab arm.’ They specified a 2-sided significance level of 5% and a power 80%. A preliminary calculation setting DE = 1 in Equation (14.25) gives N0 = 76.96. If we assume this level of improvement is apparent by 16 weeks postrandomisation and is then maintained at about this level over the subsequent 24, 36 and 48 weeks, then this is a CHANGE design. Thus, assuming an exchangeable auto-correlation structure, Equation (14.17) with v = 1 and w = 4 leads to

14.8 PRACTICALITIES

Example 14.15

347

(Continued)

DECHANGE = 5(1 − ρ)/4 and if ρ = 0.375, as in Example 14.14, then DECHANGE = 0.7813. When combined with N0 the required sample size is NCHANGE = 76.96 × 0.7813 = 60.13 or 62 patients to allow for equal numbers in each treatment arm.

14.8

Practicalities

14.8.1 Number of repeat assessments By use of a repeated measures design, the number of patients required for a clinical trial can be reduced at the expense of increasing the numbers of observations made on each patient. This strategy implies that the total number of observations required is greater than would have been the case if only a single assessment on each individual recruited had been made. This increase clearly has resource implications in terms of the number of examinations that have to be made. Further, it increases the complexity of the follow-up scheduling and thereby increases the possibility of patient non-compliance in this respect. For example, when considering the special case v = w = 1 in which there is only a single post-intervention assessment, the DEs of Equations (14.17) and (14.18) reduce to DECHANGE = 2(1 − ρ), which is ≤1 if ρ ≤ 0.5, and DEANCOVA = 1 − ρ2 ≤ 1 respectively. Thus, the addition of a baseline to a single post-randomisation assessment design can reduce the sample size required. In general, by taking repeated measurements at regular time points, the required total sample size is considerably reduced. Nevertheless, there is a practical limit to increasing the number of post-randomisation assessments. Thus, with v = 1, but with w increasing to infinity, DECHANGE tends to (1 − ρ) and DEANCOVA approaches ρ(1 − ρ). The maximum value of DECHANGE = 1 occurs when ρ = 0 in which case the repeat observations from an individual are independent of each other which is somewhat unlikely. On the other hand, the maximum value of DEANCOVA = 0.25 when ρ = 0.5. Thus, the best we can do in the latter case by increasing w is to reduce the sample size to a quarter that of a design with the outcome measure assessed once with no (pre-randomisation) covariates involved. 14.8.2 Making the assessments Although precisely when the assessments are to be made is specified in the protocol for repeated measures design such schedules often have to fit into, for example, the vagaries of hospital practice and patient compliance. Thus, for example, the trial of Yeo, Yeo, Hadi, et al. (2018) specified a 12-week assessment but of the 49 heart failure patients

348

14 REPEATED MEASURES DESIGN

recruited only 37 completed their VAS at that time. It is also likely that some of these would not be completed ‘exactly’ at 12 weeks but only within a window around that time. In certain cases, a bandwidth around each assessment date may be set by the investigators and only observations made within that are considered as usable data for the time-point concerned. 14.8.3 Choice of auto-correlation structure We have indicated in Figure 14.6 some possible auto-correlation structures. In repeated measures designs, the independent structure of Figure 14.6a is unlikely. Technical reasons restrict the use of the autoregressive structure of Figure 14.6c. These reasons include the necessity for equal spacing between successive observations and complete profiles for every patient. The unstructured format of Figure 14.6d appears the most tempting to use but, as all the w(w − 1)/2 correlation parameters have to be estimated this results in inflated standard errors for the parameters in the model itself. Also, if the trial is small, there may be so-called convergence problems which prevent the computer algorithm being able to complete the calculations. Consequently, the exchangeable auto-correlation structure of Figure 14.6b is generally specified. 14.8.4 Strength of auto-correlation In some situations, previous experience may suggest a planning value for the magnitude of the auto-correlation but more often such information will not be available. Even from published trials, the value of ρ chosen for the design may not be indicated, and without recourse to the original trial data we cannot estimate a planning value for ρ. For the case of exchangeable auto-correlation, Montoya, Higuita, Estrada, et al. (2012) in HIV-infected, HAART naïve patients assume a ‘… fixed correlation of 0.8 between three different measurements at (0, 6 and 12 months), …’ while Yeo, Yeo, Hadi, et al. (2018) in their trial in patients with heart failure state that ‘… an autocorrelation coefficient of 0.6 were assumed … with one pre-intervention and two post-intervention measurements.’ No specific justification for these values was given although they are close to the suggestion of Frison and Pocock (1992) that values of ρ between 0.60 and 0.75 are common in repeated measures designs. In many cases, the design team might explore how the DE will change depending on the value set for ρPlan. A pragmatic approach is then to adopt a ‘worst-case scenario’ and take the maximum DE as the multiplier when estimating the final trial size. 14.8.5 Withdrawals and missing values As one might expect, as the number of assessments of individual patients included in any repeated measures design increase, there is an increasing likelihood that the number of ‘missed’ assessments will escalate with time. Thus, it is important to ensure at the

14.8 PRACTICALITIES

349

design stage of such a trial, that due consideration is taken of the assessment burden on the patient as well as on the investigating team. Such considerations may restrict repeated measures designs to situations where patients are likely to be compliant with the follow-up schedule required and also to the choice of assessments which are in themselves not too burdensome. There are numerous possibilities for the different patterns of missing data ranging from, for example, following a baseline assessment all but one of the possible w (>1) assessments stipulated by the design is made (termed missing) to none of the potential post-randomisation observations made (termed withdrawals). To take into account the possibility of withdrawals, the use of Equation (9.15) is suggested. However, there is no straightforward way of dealing with ‘missing’ values although many suggestions are made by Fayers and Machin (2016) who devote a whole chapter to this topic in their book concerned with longitudinal Quality of Life studies. 14.8.6 Protocol description of intended analysis In Chapter 3, we indicated that the full details of the analysis to be made (and hence presented in the final report) must be stipulated in the protocol. However, it is often very difficult to be entirely sure of how best to approach the analysis of a repeated-measures trial so our suggestion is that the difficulties should be briefly outlined in the protocol with a sketch of the strategy that one is likely to adopt at the analysis stage. In addition, indicating the possibility that alternative strategies may have to be adopted. Of course, when reporting the actual trial results, one should include some rationale for the choice of the final analytic approach chosen. Clearly, the strategy of first ‘looking at the data’ as in Example 14.6 then instituting an analysis which seems appropriate is not consistent with specifying details of the subsequent analysis (often a regulatory requirement) in the trial protocol at a time when no patient has yet been recruited to the trial. One option in such a situation is to (perhaps) specify one of the post-randomisation assessments as the primary end point and describe the approach to the analysis of that in the protocol. Added to this, provide a brief description of the more exploratory analyses together with an indication that when appropriate a fuller description of these will be reported in the eventual trial publication. Further, the choice of the appropriate auto-correlation structure may be quite challenging without reference to previous work. Indeed, this choice will need to be considered at the design stage when an appropriate sample size for the trial is being considered. At the analysis stage, that structure considered at the design stage needs to be used to fit the corresponding models. Nevertheless, a pragmatic approach is also to utilise some of the other options available within the chosen statistical package and compare the results. In many instances, the shape of the longitudinal profiles may differ considerably from what was anticipated. In such instances, when the final interpretation of the key feature (that is the estimate of βTreat and its associated CI) differ little between the options then a conservative approach may be to report that with the ‘widest’ CI.

350

14 REPEATED MEASURES DESIGN

Of course, if the new trial in question is similar in nature to a previous trial then this prior information may enable more focused sample size and analysis sections of the protocol to be drafted. Further, in order to fit the regression model appropriate to the trial design, chosen specialist statistical packages are required and exactly which are used should be specified within the protocol. However, dependent on the trial’s likely duration, such packages may become upgraded and more analytical options may come available. An indication of this possibility should be noted in the protocol and any consequences noted in the trial report.

14.9

Reporting

The satisfactory reporting of the results from repeated measures designs are very challenging. Dependent on the design chosen, the trial report should include a table in the format of, for example Figure 14.15, together with a verbal description of its contents. In such a tabulation, when there is a baseline assessment, their values are not compared between the intervention options, whereas post-randomisation comparisons are clearly relevant. If the repeated measures are taken at pre-defined times then the number of patients presenting at each time-point within each intervention needs to be specified. As may be anticipated in many repeated measures designs, Figure 14.15 shows some attrition with time in that although 49 patients had baseline assessment, their numbers had reduced at each assessment to a level of 44 by week-12.

Baseline Week 0

Week

P N Mean

25 + 24 = 49

1

4

12

Mean post baseline

n

23

23

23

23

Mean

76.3

83.9

87.3

82.5

Range

47.9–95.8

78.0–89.8

84.6–89.9

78.8–86.1

50.88

Range 19.27–79.85 FCM

Difference

n

24

23

21

21

Mean

75.4

78.6

81.2

79.2

Range

32.0–97.9

69.9–87.3

74.2–88.2

71.9–86.6

N

47

46

44

44

P - FCM

0.9

5.3

6.1

3.3

95% CI

−0.9 to 10.7

−4.9 to 15.6

−1.0 to 13.1

−4.5 to 11.0

Figure 14.15 Possible tabular presentation of the results from the Kansas City Cardiomyopathy Questionnaire (KCCQ) by P and FCM treatment groups from a repeated measure study with a continuous endpoint variable. Source: Data from Yeo, Yeo, Hadi, et al. (2018).

14.9 REPORTING

351

The choice of summary measures to present will depend on the trial concerned but mean values within each intervention would be a common choice. However, the way in which the variability surrounding those means is described is more problematical. Thus, the main purpose of the table is to give the reader some feel for the ‘shape’ of the data and the associated ‘variability.’ The formal analysis, which will often use a modelling approach, should be presented elsewhere in the trial report. We recommend providing the range (minimum and maximum values) for each intervention at each time-point. In this example, there is some evidence of a greater variability in KCCQ among patients receiving FCM. However, confidence intervals (CIs) of each mean (here 3 in each treatment group) are not appropriate and therefore should not be presented. A trial by its very nature is a comparative study, so the interest should focus on the difference between the intervention means and not their individual properties. Thus, the lower panel of Figure 14.15 includes the difference between the two intervention means at each time-point. From these four values, there is suggestion that, as time passes, the difference between them with respect to KCCQ values increases. To give some idea of the reliability of these individual differences, we have suggested presenting the 95% CI of each. However, these CIs should not be taken as evidence of a (statistical) difference between the interventions as this is the role of the modelling process. The final column of Figure 14.15 is only appropriate in this case, if the underlying model used for the analysis is the POST option of Figure 14.13. In most instances, it is also important to include some graphical support in order to give more meaning to the summary details provided in the tabular format. In particular, it is important to demonstrate some justification for the model(s) chosen. Thus, for example, Figure 14.1 gives a strong indication of the underlying model with neat parallel post-randomisation profiles suggesting a constant difference between groups over time. However, it provides no indication of the shapes of individual patient profiles. Plotting individual data points in slim clouds at each assessment point will not always provide a satisfactory presentation unless the data points are few and/or the interventions are sufficiently distinct to allow minimal overlap between the intervention groups. However, by colour coding data points of the P and FCM interventions of the trial of Yeo, Yeo, Hadi, et al. (2018) in Figure 14.16, the two groups can be distinguished. This figure utilises a ‘jittering’ device which, for example, plots the data of Week-1 scattered around coincident and near coincident points with the same and/or similar KCCQ values to avoid overlay. At each assessment point, variation in KCCQ is considerable in both groups, and values are clearly overlapping. Despite these, there is a clear increase over baseline at Week-1, a further increase to Week-4 and that level is approximately retained at Week-12. Adding (connected) summary profiles of each individual patient to this plot is unlikely to be useful. In Figure 14.16, there is a strong suggestion that the distributions of KCCQ at each assessment point are rather skewed. As a consequence, the geometric mean (GM) has been used to define the summary profiles. This may impact on the tabular presentation of Figure 14.15 by replacing the mean by the GM at each time-point for the two interventions. Means, provided their use does not unduly distort the interpretation, are easier to interpret than GMs particularly when expressing the difference between the intervention groups as in the lower panel of Figure 14.15.

352

14 REPEATED MEASURES DESIGN

100 0.9% Saline (Placebo) 80 Ferric Carboxymaltose (FCM)

KCCQ

60

40

20

0 0

1

4

8

12

Time from randomisation (week)

Figure 14.16 Jittered scatter plot of baseline and successive post-randomisation Kansas City Cardiomyopathy Questionnaire (KCCQ) by treatment group with the geometric means at each visit connected. Source: Data from Yeo, Yeo, Hadi, et al. (2018).

Although useful, summary profiles may give no real clue as to the appearance of the individual profiles of each patient and the extent of variation between them. Only in exceptional cases could a single graphic include all the individual profiles as these are likely to overlap considerably. Hence tracing by eye individual profiles would be impossible. Nevertheless, it is often useful to demonstrate in some way the considerable variation which is usually present in these profiles by selecting (in some way – possibly at random) a sample of profiles and illustrating them graphically as in Figure 14.17. Once again, it may still be quite difficult to identify individual profiles so that the format of Figure 14.18 may be preferred. The statistical methods section in the report of the trial will have described the specific detail of the underlying form of the model chosen for the analysis. Such a model may, for example, include a baseline assessment which may turn out to have no influence on the interpretation with respect to the difference between the interventions being compared. In this case, this fact should be verbally indicated but the final statistical model presented would not include the baseline term. In all instances, the model presented must include the intervention effect, whether statistically significant or not, as this is the main concern of the trial. Often the time element is also included, again whether or not statistically significant. An illustration of what may be included is given

100

100 0.9% Saline (Placebo)

Ferric Carboxymaltose (FCM)

80

60

60

40

40

20

20

KCCQ

80

0

0 0 1

4 8 Time from randomisation (week)

12

0 1

4 8 Time from randomisation (week)

12

Figure 14.17 Individual profiles of the Kansas City Cardiomyopathy Questionnaire (KCCQ) score of patients with heart failure: Five selected from each of the P and FCM groups. Source: Data from Yeo, Yeo, Hadi, et al. (2018).

Placebo

FCM

KCCQ

13

15

100

100

75

75

50

50

25

25

0

2

4

6

8

0 17

19

100

100

75

75

50

50

25

25

0

0 01

4

8

12

01

4

8

12

01

Time from randomisation (week) Graphs by id

4

8

12

01

4

8

12

Time from randomisation (week) Graphs by id

Figure 14.18 Profiles to illustrate variation in KCCQ of eight selected patients with heart failure. Source: Data from Yeo, Yeo, Hadi, et al. (2018).

354

14 REPEATED MEASURES DESIGN Standard Coefficient

Error (SE)

Intercept

79.11

25.00

Treatment Effect

−1.61

15.22

Time

0.58

1.03

Baseline (6MWT)

0.91

0.09

z

p-value

−0.11

0.92

10.63

0.0001

95% CI

−31.43 to 28.22

Figure 14.19 Summary model of change in 6-minute walk test (6MWT) with P and FCM in patients with heart failure. Source: Data from Yeo, Yeo, Hadi, et al. (2018).

in Figure 14.19, essentially a repeat of Figure 14.10, but this should not be taken as a true indication of the results of the trial conducted by Yeo, Yeo, Hadi, et al. (2018). In this table, unnecessary details are omitted, for example the statistical package used for analysis will include z, p-value and the 95% CI for Intercept and Time which, in this instance, are unlikely to impact on the interpretation of the trial results. The main feature is the Treatment Effect which is very small, has a large SE, hence a large (non-significant) p-value and a wide CI. An important addition to this table, is the very statistically significant Baseline suggesting that the pre-randomisation 6MWT is very predictive of post-randomisation values. However, this does not necessarily imply that its presence in the model materially affects the reported value of the Treatment Effect. A comment on whether it does or does not should be included in the trial report. Consideration needs to be given to the number of decimal places reported. We have chosen two decimal places although in considering the small Treatment Effect of −1.61 and its associated CI, this could have been reduced to −1.6. However, carrying this through to the other items in the table, the Baseline coefficient becomes 0.9 with SE 0.1, which would give z = 0.9/0.1 = 9.0 in place of 10.63. This rounding error distortion suggests we should retain the two decimal places in the presentation. In a similar way, the chosen number of decimal places used in Figure 14.15 focusses on the smallest difference between the two groups which is 0.9 at Week-1 although the upper panels of the table suggest that all might be rounded to integer values. The above possibilities are still feasible if more than two interventions are to be compared although if complex summary profiles are obtained, such as those given in Figure 14.2, deciding how best to convey graphically the results in a succinct manner may not be at all straightforward.

14.10

Matched organs receiving the same intervention

14.10.1 Design In some situations, a comparative two-parallel group trial comparing S and T may be designed which enables one of the interventions to be given to a patient but measures of the same outcome may be made at more than one site. For example, if the trial is

14.10 MATCHED ORGANS RECEIVING THE SAME INTERVENTION

355

concerned with patients with glaucoma then the patient may have two eyes affected which may then each be treated by the same randomly allocated T intervention. Other patients will have both eyes allocated to S. In all patients recruited to the trial, both eyes might then be assessed at a single fixed time-point post-randomisation. Such a design, although concerning repeated measures on the same patient, is not longitudinal in structure. This extends to other situations such as, for example, patients presenting with multiple burns in the trial conducted by Ang, Lee, Gan, et al. (2001) of Example 2.2, those with eczema at multiple sites in the trial of Meggitt, Gray and Reynolds (2006) of Example 1.7, or restoratives for dental caries in several oral locations as considered by Lo, Luo, Fan and Wei (2001) in Example 13.13. 14.10.2 Single outcome per patient Although for example, if both eyes of a patient are affected by the condition of concern, then with the same treatment given to each eye the patient may respond to treatment in neither eye, one eye, or both eyes. Thus, the possible pairs of response to the two eyes are (0, 0), (1, 0), (0, 1) and (1, 1). In this case, information from both eyes can be combined, and the consequent outcome variable can be regarded as taking a possible level response of 0, 1 or 2. Thus, each patient (albeit determined from two sites) now contributes a single observation and the corresponding analysis of the trial would involve an ordered categorical variable with κ = 3 levels: 0, 1 and 2, as discussed in Chapter 8. 14.10.3 Several (single-time) outcomes per patient In other situations, the information on each eye may be retained (rather than subsumed into a single measure) in which case the two responses, as they are from eyes of the same patient, will be correlated to some extent. This within-subject correlation will then need to be taken account of by, in the subsequent data base, noting that these responses are provided by the same subject. In simple terms, the individual patient will define a cluster of two rows (one for each eye) within the database, one column of which contains the single eye response and a second the unique patient identifier. Retaining the responses from each eye as distinct, widens the options for the trial design to enable patients with only a single eye requiring treatment to be recruited. In which case the final database will concern some clusters of information from two eyes and other clusters concerning a single eye. Giving each eye a distinct row in the database enables baseline features of that eye, which may be thought to be informative covariates, to be added. The subsequent analysis then has the potential to check on their possible impact on the estimated effect size and thereby improve the precision by which it is estimated. In other circumstances, there may be many sites of concern with a variable number from patient to patient. Thus, patients with pressure sores may have them at several locations, perhaps with one or more on each heel, several on the buttocks and back – each

356

14 REPEATED MEASURES DESIGN

of which is to be monitored and assessed. In which case, the clusters (here the patients) will have a wide range of pressure sores under observation. Although each patient defines a cluster, it is the patient him or herself who is (individually) randomised to S or T. This is distinct from the situation of Chapter 16 Cluster Designs which is concerned with designs in which groups of subjects (rather than individuals) are randomised to the interventions under test. Nevertheless, in order to take care of the within-subject correlation the method of analysis for the trial designs discussed here mirrors that of Chapter 16 and so is not outlined here. To complicate the situation further, one might envisage the designs of this section, comparing more than two interventions, having a longitudinal component and/or incorporated into a cluster design of Chapter 16.

CHAPTER 15

Non-Inferiority and Equivalence Trials In this chapter, contrasts are made between trials designed to detect superiority, with those to demonstrate Non-inferiority or equivalence. Thus, the non-inferiority designs focus on whether or not a proposed alternative therapy can replace the standard therapy as, for example, it is less burdensome for a patient. In this scenario, the alternative is presumed not to be more efficacious than the standard. Nevertheless, provided the loss is not more than what is regarded as clinically important reduction, the alternative would be recommended to replace the standard. In other circumstances, a new formulation of a specific drug may be required to deliver the same quantity of the active compound. Not too far below or too far above that delivered by the standard. In this case, the equivalence of the alternative to the standard is required. Methods of analysis and for estimating the numbers of participants to be recruited to such trials are given. Comment on some practical issues arising and implications for reporting are included.

15.1

Introduction

So far in this book, we have based our discussions around clinical trials of different parallel-group designs and which focus on establishing differences between the alternative interventions being compared. Often these compare a standard treatment or intervention with a new or alternative approach that is hoped may be more effective. In general, these are termed as Superiority trials. In other circumstances, an alternative therapy may be suggested to replace the standard provided its efficacy is not much less than the standard. Such trial types are termed as Non-inferiority trials and although the basic design may appear to be the same as for a superiority trial, some issues affect the size of the trial, as well as their conduct, analysis and interpretation. In certain circumstances, Equivalence or otherwise of the alternative interventions is to be established.

Randomised Clinical Trials: Design, Practice and Reporting, Second Edition. David Machin, Peter M. Fayers, and Bee Choo Tai. © 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.

358

15.2

15

NON-INFERIORITY AND EQUIVALENCE TRIALS

Non-inferiority

Implicit in a comparison between two groups in a randomised trial is the presumption that if the null hypothesis is rejected; then, a difference between the groups being compared is presumed. Thus, if this involves a comparison of two treatments then one hopes to conclude that one treatment is superior to the other irrespective of the magnitude of the difference observed. In certain situations, a new therapy may bring certain advantages over the current standard for a particular condition but not necessarily be more therapeutically effective. However, if following a comparative clinical trial, the new approach has proven superiority then this, together with the advantages it brings, will be sufficient justification to replace the standard approach. Also, if the treatments to be compared are for an acute but not serious condition then a cheaper but not so efficacious alternative may be an acceptable replacement for the standard. In general, however, the new therapy will be required to be ‘not worse than’ the standard to a predefined clinical extent. 15.2.1 Limit of non-inferiority If a Test (T) intervention is to replace the Standard (S) in future clinical use, then T is only required to be at least ‘non-inferior’ to S with efficacy. This implies that ‘non-inferiority’ is a pre-specified maximum difference between two groups which if observed to be smaller, after the clinical trial is conducted, would render T to be described as noninferior to S. 15.2.2 Hypothesis If T and S are to be compared using patient response as the outcome then, for a superiority trial, the null hypothesis is specified by H0: δ = π T − π S = 0 and the 2-sided alternative hypothesis by HA: δ 0. In the situation when Non-inferiority is required, π T needs at the worst to be no lower than π S by a pre-determined amount, the null hypothesis has only one component of inferiority so that H0: δ < η. The alternative hypothesis of Non-inferiority is HA: δ > η. For the situation when T may be anticipated to be less effective than S, such as in a lower response to treatment, η may be specified to assume a negative value. In contrast, when T may be anticipated to be more adverse than S, perhaps implying a greater disease recurrence rate may be expected, then η may be specified to assume a positive value. These situations are represented in Figure 15.1a. 15.2.3 Establishing non-inferiority Although the relevant hypotheses have been outlined in Figure 15.1, at the analysis stage, no hypothesis tests are conducted but decisions with respect to concluding non-inferiority are made using confidence interval limits.

15.2 NON-INFERIORITY

359

Confidence Error probability one-sided Type I Type II Null Alternative 100(1 − α)% α β (a) Non-inferiority: The aim is not to estimate δ but to judge if it is within the margin defined by η. Requires η to be pre-specified T less effective than S anticipated Hypotheses

Type of trial

ηη Inferior

(b)

δ≥η Non-inferiority established δ≤η Non-Inferiority established

Interval two-sided 100(1 − 2α)%

one-sided one-sided 0.05

0.1 or 0.2

95% LL

90% CI

95%UL

90% CI

one-sided one-sided 0.05

0.1 or 0.2

Equivalence: The aim is not to estimate δ but to judge if it is within the margins defined by η. Requires η to be pre-specified −η ≤ δ ≤ + η two-sided Usually two-sided δ < −η or δ > + η Equivalence two-sided 0.05 95%CI Not equivalent established 0.1 or 0.2

Figure 15.1 Hypotheses associated with parallel-group trials to establish (a) non-inferiority and (b) equivalence when comparing a test (T) with a standard (S), where δ = π T − π S

15.2.3.1 T less effective than S anticipated In this situation, the judgement with respect to Non-inferiority is made on the basis of the lower limit (LL) of the 1-sided 100(1 − α)% CI for the difference of, for example, pT − pS, given by LL = Difference − z 1 − α SE Difference

(15.1)

This is equivalent to the lower bound of a 2-sided 100(1 − 2α)% CI. Non-inferiority for T is concluded if the calculated LL at the end of the trial is above the non-inferiority margin, η ( +η. The alternative hypothesis of equivalence is, HA: −η < δ ≤ η as in Figure 15.1b.

15.5 EQUIVALENCE

371

15.5.2 Analysis The analysis required for assessing equivalence only entails a 2-sided (1 − α)% CI of the difference (however expressed) between the two therapeutic approaches concerned. As indicated in Figure 15.5, only when both ends of the CI so obtained lie within the ‘equivalence’ region can equivalence be concluded. In general, it is not the width of the CI per se which determines equivalence but its location. Thus, even a narrow CI may cross a boundary in which case equivalence would not be established. As an intention-to-treat (ITT) analysis is a conservative strategy in that any protocol deviations will tend to dilute differences between the interventions, a per-protocol (PP) analysis which only includes patients complying with the protocol is often also conducted. Clearly, if both analyses indicate equivalence (or not), there is no difficulty with the interpretation. If not, some judgement is required with respect to the conclusion drawn.

Equivalence Confirmed A

Equivalence not Established B

Equivalence not Established C

–η

0 T-S

η

Figure 15.5 Schematic diagram to illustrate the concept of equivalence by using a series of possible comparative trial outcomes as summarised by their reported 2-sided 100(1 − α)% confidence intervals: Conclusions: Trial A – Equivalence confirmed; Trial B – Equivalence not established; Trial C – Equivalence not established

372

15

Example 15.11

NON-INFERIORITY AND EQUIVALENCE TRIALS

Parallel-group design – head louse infestation

The equivalence trial, Burgess, Brown and Lee (2005) randomised 253 people with head louse infestation to treatment by Dimethicone (Di) lotion or Phenothrin (Ph, Standard) liquid which was the most widely used pediculicide. They set the limit of equivalence to be 20%. This implies that the investigators are setting η = 0.2 and hence the composite null hypothesis as H0: δ < −0.2 and δ > +0.2. An ITT analysis of positive outcomes gave the responses as pDi = 89/127 = 0.7008 and pPh = 94/125 = 0.7520 for Di and Ph, respectively. From these, d = −0.0512 and the 95% CI ranges from −0.1611 to 0.0587 or approximately −16 to +6%. The corresponding rates for a PP analysis were pDi = 84/121 = 0.6942 and pPh = 90/116 = 0.7759 leading to a 95% CI ranging from −0.1934 to 0.0302 or approximately −19 to +3%. Since the 95% CIs for δ for both ITT and PP analyses are within the margin of equivalence, that is, −20% < δ < 20%, equivalence was therefore established in these two instances. This is similar to the situation of Trial A in Figure 15.5.

15.5.3 Trial Size As indicated in Figure 15.1b, the test size, α, and the Type II error, β, are now both 2sided and with these properties noted, the expressions given for trial size in the earlier sections referring to non-inferiority are readily adapted to Equivalence trials.

Example 15.12

Cross-over design – type 2 diabetes

King (2009) compared, in a randomised double-blind cross-over design, oncedaily insulin detemir (D) with insulin glargine (G) for providing glycaemic control in patients with type 2 diabetes. The results of this trial indicate a mean of approximately 130 mg/dL for the 24-h glucose levels and assume a SDDiff = 20 mg/dL. Assuming a confirmatory equivalence trial of the same design is planned, with patients randomised equally to the two sequences DG and GD. Suppose, it is planned with μG and μD presumed equal and with an equivalence margin set as η = 5.5 mg/dL. From Table T3 for 2-sided α = 0.05, z0.975 = 1.96 and for 2-sided β = 0.2, z0.9 = 1.2816. As the endpoint is continuous and this trial has a cross-over design, the required sample size is calculated from Equation (15.9). This gives N Pairs =

1 96 + 1 2816 0−5 5 2 20

2

= 138.94 or 140 sequences and implies 70 patients should

be randomised to each of the DG and GD sequences.

15.8 GUIDELINES

15.6

373

Reporting

Detailed requirements for reporting the results of non-inferiority (and equivalence) trials can be found in Piaggio, Elbourne, Altman, et al. (2012) which extends the CONSORT statement appropriately. They meticulously illustrate how all the key aspects of a trial should be addressed and provide useful examples and commentary. They also highlight features of the reporting process which relate to specific design features of the trial, such as the choice of the non-inferiority limit and sample size, so this paper also serves as a checklist for planning purposes.

15.7

Practical Issues

As we discussed in Chapter 2, the application of the ITT principle to a superiority trial is a conservative procedure. Thus, ITT will tend to dilute the difference between the randomised interventions (since they become more similar whenever a participant, for example, refuses the randomised option but then receives the alternative) and thereby reduces the chance of demonstrating efficacy should it exist. However, Piaggio and Pinol (2001) point out that for non-inferiority and equivalence trials the dilution caused by ITT will not act conservatively as in these cases any dilution will tend to favour, as appropriate, the non-inferiority or equivalence hypothesis. Thus, a PP analysis is recommended although Jones, Jarvis, Lewis and Ebbutt (1996) suggest that an ITT as well as a PP analysis should be conducted. In certain circumstances, a randomised trial may incorporate both non-inferiority and superiority comparisons in what may be termed as follows: ‘As good as or better trials’. Thus, in the statistical analysis section of the trial of Example 15.4, Schroeder, Werner, Meyer, et al. (2017, p. 2229) state: ‘A non-inferiority margin of 5% absolute difference was deemed a clinically nonsignificant difference. A priori, if non-inferiority was met and the lower 95% confidence interval was >0%, superiority would be claimed’. At the analysis stage, such designs imply that a test of non-inferiority is first conducted. Then, provided non-inferiority is established, a test for superiority then follows. On the other hand, if non-inferiority is not established, then no subsequent test for superiority is conducted.

15.8

Guidelines

Relevant guidelines and tutorial papers relevant to the content of this chapter are collated here. Flight L and Julious SA (2016). Practical guide to sample size calculations: noninferiority and equivalence trials. Pharmaceutical Statistics, 15, 80–89. Piaggio G, Carolli G, Villar J, Pinol A, Bakketeig L, Lumbiganon P, Bergsjø P, Al-Mazrou Y, Ba’aqeel H, Belizán JM, Farnot U and Berendes H (2001). Methodological

374

15

NON-INFERIORITY AND EQUIVALENCE TRIALS

considerations on the design and analysis of an equivalence stratified cluster randomization trial. Statistics in Medicine, 20, 401–416 Piaggio G, Elbourne DR, Pocock SJ, Evans SJW and Altman DG (2012). Reporting of noninferiority and equivalence randomized trials: extension of the CONSORT 2010 statement. Journal of the American Medical Association, 308, 2594–2604. Walker E and Nowacki AS (2011). Understanding equivalence and noninferiority testing. Journal of General Internal Medicine, 26, 192–196.

CHAPTER 16

Cluster Designs

In general, we have based our discussions around parallel group designs in which individual subjects are randomised. However, one adaptation of this basic design is to allocate the interventions at random to collections or clusters of participants rather than to individuals. Thus, the distinctive characteristic of a cluster trial is that specific groups or blocks of subjects (the clusters) are first identified, and these units are assigned at random to the interventions. The term ‘cluster’ in this context may be a household, school, clinic, care home or any other relevant grouping of individuals. Although the basic design structure of a parallel two-group design may be retained, issues of informed consent, trial size and analysis are somewhat unique. For example, when comparing the interventions in such cluster-randomised trials, the account must always be made of the particular cluster from which the data item is obtained. We also comment on some practical issues arising and implications for reporting.

16.1

Design features

In certain situations, the method of delivery of the intervention prevents it from being given on an individual participant basis but instead the delivery is made to collections of individuals. For example, if a public health campaign conducted through the local media is to be tested, it may be possible to randomise locations, which are then termed clusters, to either receive or not receive the planned campaign. It would not be possible to randomise individuals. Thus, Fayers, Jordhøy and Kaasa (2002) comment that cluster trials are particularly relevant when evaluating interventions at the level of clinic, hospital, district or region. In a two-group cluster-randomised trial, several (usually half ) of the clusters will receive one intervention and the remainder the other. Thus, a whole cluster, consisting of a number of individuals who then become the trial participants, is assigned en bloc. Nevertheless, just as for the individual-specific random allocation design, the outcome is measured on every individual. In general, cluster trials will compare g = 2 or more interventions, involving K clusters a fraction of which will receive one of the interventions, and each cluster comprising Randomised Clinical Trials: Design, Practice and Reporting, Second Edition. David Machin, Peter M. Fayers, and Bee Choo Tai. © 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.

376

16 CLUSTER DESIGNS

m subjects. Design options include the choice of m and K both of which may include non-statistical considerations; perhaps determined by the number of clusters willing to participate, and the practical limitations with respect to the number of participants recruited per cluster.

Example 16.1 Cluster design – enhanced diabetes care Bellary, O’Hare, Raymond, et al. (2008) conducted a cluster-randomised controlled trial in which inner-city medical practices in the UK were assigned by simple randomisation to intervention or control groups with the object of improving diabetes prevention and care among a high-risk group of south Asian ethnic origin. The intervention consisted of enhanced care including additional time with the practice nurse and support from a link worker and diabetes specialist nurse, while the control was standard care. Primary outcomes were changes in blood pressure, total cholesterol and glycaemic control (haemoglobin Ak) after 2 years. The results suggested that ‘small but sustained improvements in blood pressure can be achieved’ by the use of enhanced care.

A further situation where cluster designs are useful is where there is a possibility of contamination in the delivery of the intervention itself. For example, in the trial of Bellary, O’Hare, Raymond, et al. (2008), it would not be easy to randomise half the individuals to ‘enhanced care’ and the other half to ‘standard care’ within the same medical practice. Conceptually, this could be done but it would then be very difficult for the practice team involved in the ‘enhanced care’ group to disappear or change their mode of operation while ‘standard’ care is being delivered to an individual. This difficulty raises the possibility of contamination in the way ‘standard care’ is delivered and/or received in a similar way to that of the carry-across problem of the split-mouth design (see Section 13.3.2). Any resulting contamination will almost certainly diminish the observed magnitude of any differences resulting from using the approaches under evaluation.

16.2

Procedures

16.2.1 Consent The way in which the consent process is instituted will depend critically on the specific design of the trial concerned. For some trials, agreement to participate may only concern the cluster management representatives. Thus, if care homes for the elderly are concerned in a trial of working practices within the home with an endpoint (say) of

16.2 PROCEDURES

377

the number of their residents that will experience falls over a given time, then consent from the residents may not be required. In such a situation, neither may consent from the care-home workers involved be required. In other cases, if care workers are to be trained to implement the intervention allocated then their consent may be required but possibly not that of the residents concerned. However, if the endpoint involves direct assessment of the residents themselves then consent to participate (as opposed to assent to be randomised) from each individual may be required. Further, the actual consent procedures to follow are likely to depend on the requirements of the regulatory authority under which umbrella the trial is being conducted. Lignou (2018, Figure 1) indicates that different types of health research can be classified based on the associated risk and the degree of intrusion into personal life. She also highlights some of the associated ethical challenges and requirements for cluster trials.

Example 16.2 Facemask protection for health care workers MacIntyre, Seale, Dung, et al. (2015) conducted a three-arm cluster-randomised trial involving health care workers (HCW) from 14 hospitals, comprising 74 wards, to assess the impact on the respective levels of clinical respiratory illness (CRI) on those wearing either a cloth mask (CM) or medical masks (MM) at all times on their work shift compared to a control group using standard practice (SP). In this trial, from the eligible wards, 1607 of 1868 HCWs provided informed consent. Thus, although the clusters were the 74 wards, consent was provided by the individual HCWs concerned.

Example 16.3

Exercise for depression in elderly residents in care homes

The cluster-randomised trial of Underwood, Lamb, Eldridge, et al. (2013) tested whether a moderate intensity exercise programme would reduce the burden of depressive symptoms in residents of 78 care homes. They state that: ‘Once care homes had agreed to participate, we invited residents to give written, informed consent, or if they lacked capacity to consent, for their next of kin to give written, informed agreement for us to collect data directly from participants, from carehome staff, and from care home and National Health Service (NHS) records’. The two-level consent process involved consent at the care home and consent by the participant (or next of kin).

378

16 CLUSTER DESIGNS

16.2.2 Randomisation In individual-patient randomised trials, the patients will most likely present one at a time and can be randomised accordingly. In contrast, for cluster-randomised trials, all the clusters are usually identified before the trial starts. Thus, any stratification and randomisation (within strata) can be carried out, and the clusters are then informed of their allocation before beginning patient entry. Consequently, as is the case with the consent process, the clusters are all randomised to the interventions concerned before any recruitment begins. In general, the number of clusters is often quite limited but it is the clusters themselves that are randomised rather than the individuals within each cluster. The purpose of randomisation is to try and balance characteristics associated with the cluster although, as their number is usually small, the scope for randomisation to achieve a balance between them is therefore limited. However, as the number of clusters increases, we would expect cluster characteristics, on average, to balance across the intervention groups, and so patient characteristics should also balance across intervention groups as a consequence. Even in a trial with a very small number of clusters, where there is little possibility of balancing patient characteristics, it is still worthwhile randomising so that complete objectivity in the intervention allocation process can be claimed. In addition, if the clusters are of variable size then, even in the situation where they are few, it is worthwhile stratifying the randomisation to the interventions by cluster size. In trials that require newly diagnosed patients, it may not be possible to specify the cluster size precisely at the planning stage. In this case, a proxy measure of cluster size, such as the size of the clinic from which the patients are drawn, can be used instead for stratification purposes. In carrying out randomisation, there may be two steps to take in a cluster design. One may be the random selection of clusters, from a larger body of potential clusters, to include in the trial; then, once selected, the chosen clusters are assigned randomly (within strata if appropriate) to the alternative interventions. In some circumstances, there may be more clusters available than are necessary for the purposes of the trial. For example, when randomising medical practices in rural areas of which there are 100, considerations of the total trial size might stipulate that 30 clusters would be sufficient. The team would then have to select practices for the trial, and this is best done at random. Thus, they choose K = 30 at random from the 100. In principle, this can be done by first numbering the practices in any order from 01 to 100 but use 00 to represent practice 100. Then, using the first two digits in (say) the first column of Table T1, we find the first 30 numbers in the range 00–99 are successively 75, 80, 94, 67, …, 87, 63. However, 03, 43, 50, 67, 90 and 94 are repeated in this list, and so the next six random number pairs are taken. These are 72, 69, 64, 31, 39 and 57 but 57 has been used previously so the next is 50. This too has to be ignored and so the next, which is 48, is taken. Now that the numbered list of 30 is complete, the corresponding medical practices are then identified from the list; these are then the clusters to be involved. The process of randomisation is usually achieved with a suitable computer program and good practice now requires that the selection from the list must be reproducible (should it become necessary for regulatory purposes), and the full process documented.

16.3 REGRESSION MODELS

379

Once the clusters are identified, then the interventions are allocated to these clusters using the methods of Chapter 5. As we have stated, in this situation, randomisation is a ‘one-off’ allocation as the actual participants within each cluster are not now individually randomised.

Example 16.4 Number of clusters In Example 1.11, we described the trial concerned with the use of hip protectors in elderly residents in nursing homes conducted by Meyer, Warnke, Bender and Mülhauser (2003). Their trial included a large number of clusters, the nursing homes, 25 of which, comprising a total of 459 residents, were assigned to the intervention group and 24 homes, with 483 residents, to the control. In contrast, the trial of Bellary, O’Hare, Raymond et al. (2008) of Example 16.1 involved less than half this number of clusters with 21 inner-city medical practices included with unbalanced randomisation of the clusters which assigned 9 practices to the intervention, comprising 868 patients, and 12 practices to control, comprising 618 patients: 250 patients fewer than for the intervention group. In this case, the total number of patients of 1486 from fewer clusters exceeded that of the nursing home trial with 942 residents. A better design might have been first to stratify the practices into two groups of small and large practices, then among these allocate to intervention or control on a 1 : 1 basis as closely as possible. This would certainly balance the numbers of patients within the intervention and control groups more evenly and perhaps balance more closely practice characteristics and hence also patient characteristics.

16.3

Regression models

When the randomised allocation applies to the clusters, the basic principles for analysis still apply although modifications are required. To illustrate these, we describe the t-test, for comparing two means from a non-cluster design comparing Standard (S) and Test (T) interventions using the linear regression terminology of Equation (8.13) with intercept β0 and slope βTreat, that is, yi = β0 + βTreat xi + εi ,

(16.1)

where the subjects concerned are i = 1, 2, …, N; xi = 0 for S and xi = 1 for T. Further, εi is a random variable with mean zero and, within each intervention group, assumed to have the same variance, σ 2. If this regression model is fitted to the data then bTreat = yT − yS estimates βTreat and the null hypothesis βTreat = 0 is tested using the t-test. However, if this was instead a cluster design, the analysis must now take account of the cluster to which an individual subject belongs. When we compare two

380

16 CLUSTER DESIGNS

interventions, N (= nS + nT) patients are recruited and these will come from different clusters each of size m. Therefore, there are correspondingly kS = nS/m and kT = nT/ m clusters assigned to the respective interventions. The total number of clusters in the trial is K = kS + kT. To allow for the clusters, the fixed-effect model (16.1) is extended to the mixed-effect model: yij = β0 + βTreat xij + γ ∗j + εij

(16.2)

Here, the clusters are j = 1, 2, …, K and the subjects i = 1, 2, …, m in each cluster. The coefficient βTreat is a fixed-effect while the cluster effect, γ ∗j , are random effects and which is taken to vary between clusters about a mean of zero, with variance, σ 2Between-Cluster . Further, the εij are also assumed to have mean zero but with variance, σ 2Within-Cluster. Both random variables, γ ∗j and εij, are assumed to be Normally distributed. The variance of yij of Equation (16.2), denoted Var(yij) = [SD(yij)]2, is the combined total of the within- and between-cluster variances, that is, Var yij = σ 2Total = σ 2Between-Cluster + σ 2Within-Cluster

(16.3)

Once again, the null hypothesis, βTreat = 0, is tested using the t-test but now the SE concerned in its calculation will differ from the previous value as the variation introduced by the K clusters influences its value.

16.4

Intra-class correlation

Despite the lack of individualised randomisation and the receipt of a more group-based intervention, the assessment of the relative effect of the interventions is made at the level of the individual participant receiving the respective interventions. As a consequence, the observations on the patients within a single cluster are positively correlated as they are not completely independent of each other. Thus, patients treated by one health care professional team will tend to be more similar among themselves with respect to the outcome measure concerned than those treated by a different health care team. So, if we know which team is involved with a particular patient, we can predict to some extent the outcome for that patient by reference to experience with similar patients treated by the same team. The strength of this dependence among observations is measured by the intra-cluster correlation (ICC), or ρ. With each subject, in every cluster, providing an outcome measure, the ICC is the proportion of the total variance σ 2Total accounted for by the between-cluster variation, that is: ρ=

σ 2Between − Cluster σ 2Between − Cluster = σ 2Total σ 2Between − Cluster + σ 2Within − Cluster

(16.4)

16.5 TRIAL SIZE

381

Although calculated in a different way, ρ is interpreted in a similar way to a Pearson correlation coefficient. However, since variances cannot be negative, the ICC cannot be negative. In general, the more heterogeneity between the clusters, the greater σ 2Between – Cluster becomes, and this then inflates the value of ρ. A major challenge in planning the sample size for a cluster design trial is identifying an appropriate value for ρ. In practice, estimates of ρ are usually obtained from previously reported trials using similar randomisation units (cluster types) and outcome measures. Bellary, O’Hare, Raymond, et al. (2008) report ICCs in a medical practice setting for systolic blood pressure, total cholesterol and haemoglobin Ak as 0.004, 0.05 and 0.05, respectively. The values arising from community intervention trials are typically 1, is given by DE = 1 + m − 1 ρ

(16.5)

As DE > 1, unless ρ = 0, a cluster design of the same specification as an individually randomised trial will require more subjects. From this, the total number of subjects

382

16 CLUSTER DESIGNS

required in both intervention groups, comprising a total of K clusters of a pre-specified size of m subjects per cluster, is given by N = N Initial × DE,

(16.6)

where NInitial is the sample size calculated for an individually randomised design with the same design parameters of the intended cluster design. We describe below the implications of cluster randomisation with examples from trials concerned with continuous and binary outcomes. 16.5.2 Continuous endpoint When a continuous endpoint is defined the initial sample size, NInitial will be obtained using Equation (9.5).

Example 16.5 Trial size – systolic blood pressure levels Bellary, O’Hare, Raymond, et al. (2008) give the intra-primary care practice correlation for systolic blood pressure (SBP) as 0.035. Suppose, an investigator wishes to repeat this trial using the same design criteria but involving a more intensive intervention (Enhanced, E) care package against the current standard (Standard, S) in medical practices, each able to recruit m = 50 patients, with the reduction in SBP as the main focus of the intervention. Although the previous trial had anticipated an effect size of 7 mmHg (SD 21.25) it was felt, bearing in mind the previous results, that a more realistic but still worthwhile effect size would be δPlan = 5 mmHg. This is equivalent to a standardised effect size of ΔPlan = 5/21.25 = 0.24 and represents a small effect using the Cohen (1988) criterion of Figure 9.8. Assuming a 2-sided test size α = 0.05, power 1 − β = 0.8 and a 1 : 1 allocation, Equation (9.5) gives NInitial = 546.98. Further with the pre-specified ICC of ρPlan = 0.035, from Equation (16.5) gives DE = 1 + (49 × 0.035) = 2.715. Hence, N = 546.98 × 2.715 = 1485.05 or 1486. If indeed the number per practice is set at m = 50, then this implies the number of clusters required is K = 1486/50 = 29.72 or approximately 30.

Example 16.6 Trial size – systolic blood pressure levels In the planned confirmatory trial of the same design as Bellary, O’Hare, Raymond, et al. (2008), the initial estimates with m = 50 per cluster and N = 1486 requires K = 30 clusters. The investigators find this number of clusters is not feasible so consider the possibility of increasing m in order to reduce K.

16.5 TRIAL SIZE

383

Example 16.6 (Continued) With KNew = 20 in mind, this suggests a value of mNew = 1486/20 ≈ 75 might be considered so that DENew = 1 + (74 × 0.035) = 3.590. This is larger than the previous value and leads to a revised total sample size of 3.590 × 546.98 = 1963.66. However, this implies the actual (rather than hoped for) number of clusters required is KRevised = 1964/75 = 26.19. This still exceeds the target number of 20. One way to dramatically decrease the number of clusters is to increase the planning requirement ΔCohen, for example, retaining m = 50, but raising this from 0.24 to 0.30, gives NInitial = 350.76. Combining with DE = 2.715 of Example 16.5, we have N = 954 and K = 954/50 = 19.08 or 20. Clearly, the investigators then have to judge whether or not this revised target is feasible and meets the trial objectives.

16.5.3 Binary endpoint For a binary outcome, the dependent variable yij of Equation (16.2) only takes the values 0 or 1 with the probability π j that yij = 1 assumed constant for each subject within a cluster. Such data are analysed using a mixed-effect logistic regression model in which, because of the clusters, γ ∗ is retained but is assumed to come from a Normal, while ε is assumed to come from a Binomial distribution. In order to calculate a sample size, the anticipated proportions responding in each intervention group, π S and π T need to be specified, from which δPlan = π T − π S. It is usual to assume that σ Plan =

π 1−π ,

(16.7)

where π = πS +2 π T . Also required is the ICC, ρBinary, for use in the expression DE of Equation (16.5). This can be obtained from ρBinary =

σ 2Between − Cluster σ2 = Between − Cluster 2 σ Total π 1−π

(16.8)

Finally, the sample size for this situation is calculated from Equation (9.6) assuming φ = 1.

Example 16.7 Comparing proportions – hypertension and hypercholesterolemia The STITCH2 randomised cluster design trial of Dresser, Nelson, Mahon, et al. (2013) involving 35 primary care practices (PCP) compared guidelines (G) for intervention with the same guidelines with an initial use of single-pill combination (SP-G) to improve management of participants with both hypertension and dyslipidaemia.

384

16 CLUSTER DESIGNS

Example 16.7 (Continued) Assuming a repeat trial is planned with m = 50 subjects from each PCP, with planning values for the proportion achieving target with G as 0.40 while that for SP-G is 0.52. From these, δPlan = 0.52 − 0.40 = 0.12, π = 0 52 +2 0 40 = 0.46 and Equation (9.6) with 2-sided α = 0.05, β = 0.2 and φ = 1, gives N=

1+1 1

1 96

1 + 1 × 0 46 × 1 − 0 46 + 0 8416 1 × 0 4 1 − 0 4 + 0 52 1 − 0 52 0 52 − 0 4

2

2

= 539 21

Further assuming ρBinary = 0.0706 (see Example 16.9 below) and using Equation (16.5), DE = 1 + (49 × 0.0706) = 4.4594. Combining these, we have N = 4.4594 × 539.21 = 2404.55. An alternative approach is to calculate σ Plan = 0 46 1 − 0 46 = 0.4984 and ΔCohen = δPlan/σ Plan = 0.2408, and then use the continuous endpoint expression (9.5) to obtain NInitial = 543.49. Combining this with DE = 4.459 gives the size of the cluster trial as N = 2423.66 which is a little larger than the binary calculation. However, both approaches suggest approximately N = 2500 subjects in total. The corresponding number of clusters K = 2500/50 = 50. This allows a 1 : 1 randomisation with 25 clusters comprising 1250 participants in each of G and SP-G interventions.

Example 16.8 Facemask protection for health care workers As outlined in Example 16.2, MacIntyre, Seale, Dung, et al. (2015) compared the use of a MM, CM with SP which may or may not include mask use control to compare the rates of CRI in HCWs from emergency, infectious/respiratory disease, intensive care and paediatric wards from hospitals in Hanoi, Vietnam with high-risk settings for occupational exposure to respiratory infections. The HCWs were followed for 4 weeks of MM, CM or SP use and then one additional week for the appearance of any CRI symptoms among them. For planning purposes, they assumed an infection rate of 13% with CM compared to 6% with MM. Thus, δPlan = 0.06 − 0.13 = −0.07, π = 0 06 +2 0 13 = 0.095. Further, Equation (9.6) with the stated 2-sided α = 0.05, β = 0.2 and φ = 1, gives N = 548.50. They used an ICC of ρ = 0.027, which was obtained from an earlier study and anticipated recruiting m = 25 HCWs from each ward. Hence, Equation (16.5) gives DE = 1 + (25 − 1) × 0.027 = 1.648. The sample size for comparing the g = 2 interventions CM and MM is then N2 = 1.648 × 548.50 = 903.93. Taking

16.5 TRIAL SIZE

385

Example 16.8 (Continued) no specific account of the anticipated infection rate with SP, the trial sample size might be increased for g = 3 to N3 = (3/2) × 903.93 = 1355.90. The investigators then rounded this number and ‘aimed to recruit a sample size of 1600 participants from 15 hospitals’. This implies, with m = 25, that K = 1600/25 = 64 wards (the clusters) from these hospitals would be required. In the event, 1607 HCW from 14 hospitals comprising 74 wards consented to participate in the trial, with an average of 22 HCW per ward. The infection rates observed were as follows: MM 4.83% (28/580), CM 7.56% (43/569) and SP 6.99% (32/458). As a consequence, the authors ‘caution against the use of cloth masks’.

16.5.4 Varying cluster size In practice, there may be substantial variation in m from cluster to cluster. In this situation, Rutterford, Copas and Eldridge (2015, Equation (17)) suggest that to allow for such variation, the DE of Equation (16.5) is modified to DE = 1 +

CV m

2

+ 1 m − 1 ρ,

(16.9)

where m is the anticipated mean cluster size and CV m = SDmm is the coefficient of variation. In practice, the CV(m) is usually less than 0.7, hence (CV(m)2 + 1) < 1.49, but nevertheless may impact on the size of DE and hence the sample size N considerably.

Example 16.9 Varying cluster size To illustrate the impact of varying cluster size on the DE, we use the results from the STITCH2 trial which includes the precise number of clusters within each intervention and the number of subjects recruited per cluster. Figure 16.1 shows that cluster size varied considerably from 2 to 47. Intervention

K(m)

Min(m)

– m

Max(m)

SD(m)

CV(m)

G

20

2

28.75

47

15.59

0.542

SP-G

15

2

23.33

45

14.83

0.636

All clusters

35

2

26.43

47

15.29

0.578

Figure 16.1 Number of clusters and the corresponding CV(m) of cluster size by the intervention of the STITCH2 trial. Source: Data from Dresser, Nelson, Mahon, et al. (2013).

386

16 CLUSTER DESIGNS

Example 16.9

(Continued)

For both interventions combined, m = 26.43, CV(m) = 15.29/26.43 = 0.578 and consequently from Equation (16.9), DE = 1 + [(0.5782 + 1) × 26.43 − 1]ρ = 1 + 34.26ρ. Thus, for values of ρ equal to 0.027, 0.035 and 0.0706, obtained from previous examples in this chapter, the corresponding DE are 1.93, 2.20 and 3.42. If there is no variation in cluster size then CV(m) = 0, and assuming m = 26.43 (although in practice an integer is required), the DE are reduced to 1.69, 1.89, 2.80. Thus implying, in these examples, that the variation in cluster size increases the DE; hence, the total sample size required by 14, 16 and 22%, respectively. Thus, in the situation of Example 16.7, with the mean cluster size anticipated to be m = 50, ρ = 0.0706 but now with CV = 0.578, DE = 4.4594 increases to DE = 1 + [(0.5782 + 1) × 50 − 1] × 0.0706 = 5.6387. Hence, N = 2404.55 increases to N = 5.6387 × 539.21 = 3040.45 or 3000.

16.6

Analysis

Although the measures taken in a cluster-randomised trial are on the individuals, a straightforward comparison of those receiving the intervention against those who do not is no longer possible as knowledge of the particular cluster an observation comes from has to be accounted for. As we noted in the model for a cluster design comparing two interventions, Equation (16.2) takes due note of the different clusters involved.

Example 16.10

Systolic blood pressure

We illustrate the method of analysis using an artificial example. Suppose, there are two interventions, and these have been randomised to five clusters which are individual General medical Practices (GP), two receiving the T intervention and three receiving S. Further, the number of patients within each practice eventually recruited to the trial differed markedly from practice to practice. The endpoint measure of concern was systolic blood pressure (SBP) and the corresponding number of participants, mean and SD from each cluster are summarised in Figure 16.2. The difference between the means from each intervention suggests that a lower SBP, of approximately 11 mmHg, is obtained among those randomised to the T group.

16.6 ANALYSIS

Example 16.10

387

(Continued)

Commands table GP Intervention, contents(n SBP mean SBP sd SBP) row

Edited output --------------------------------------------------| Intervention GP | Standard Test ----------+---------------------------------------A | 8 139.22 (16.29) B | 15 141.62 (21.59) C | 27 139.52 (19.37) D | 36 128.96 (20.84) E | 14 129.83 (27.82) ----------|---------------------------------------Total | 50 140.10 (19.27) 50 129.20 (22.71) ----------|----------------------------------------

Command mixed SBP Intervention, || GP:,vce(cluster GP)

Edited output Number of obs = 100, Group variable: GP, Number of groups = 5 Obs per group: min = 8, avg = 20.0, max = 36 (SE adjusted for 5 clusters in GP) -------------------------------------------------------------------| Robust SBP | Coef z p-value (95% CI) SE --------------+----------------------------------------------------cons | 129.204 0.276 Intervention | 10.897 0.696 15.66 S if, for example, each step is separated by two periods of time.] Total number of periods, cross-sections or blocks (P = b + T = 1 + 4 = 5) Number of cells - including any pre-roll-out (C = K × P) Number of participants per cell (m) – however this may vary Total number of participants (N) – however this may vary in each period depending on the design

Figure 17.1 (a) Design characteristics and terminology of a SWD, where shaded areas indicate intervention (YesI) applied and unshaded areas indicate control (NoI) (b) supplementary notation. Source: Based on Copas, Lewis, Thompson, et al. (2015).

394

17 STEPPED WEDGE DESIGNS

Group of clusters Group 1

Group of clusters Group 1

Cross sectional design

2

2

3

3

4

4 T1

T2

T3

Control Intervention

T4

T5

Cohort design

T1

T2

T3

T4

T5

Each time period represents a data collection point. In a cross-sectional design, data are collected from different samples of individuals, whilst in a cohort design, observations are made on the same individuals over time.

Figure 17.2 Schematic repetitions of a cross-sectional and closed cohort SWD designs. Source: Martin, Taljaard, Girling, et al. (2016).

Example 17.1

Trial design – ischaemic stroke

A randomised trial conducted by Haesebaert, Nighoghossian, Mercier, et al. (2018), evaluated Targeted Training (TT) of hospital Emergency Unit (EU) professionals with the aim of improving access to thrombolysis for patients with ischaemic stroke. Amongst other endpoints, the cluster-specific thrombolysis rates amongst ischaemic heart patients over successive periods of time were recorded. This trial typifies the kind of situation where the use of SWD would be very appropriate. In this case, TT is given on only one occasion to each EU and is not repeated again before any subsequent periods. This CS option for the SWD design, as shown in Figure 17.3a, concerned 18 EUs grouped into 4 sets of clusters with each cluster allocated to 1 of the 4 randomised arms.

17.2 NOTATION

Example 17.1

395

(Continued)

(a) Design Arm

TT

G1

t0

G2

t0

t1

G3

t0

t1

t2

G4

t0

t1

t2

t3

t1

TT

t2

t3

t4

t2

t3

t4

t3

t4

TT

TT

t4

Period 1

Period 2

Period 3

Period 4

Period 5

Pre-roll-out

Intervention G1 (Step 1)

Intervention G2 (Step 2)

Intervention G3 (Step 3)

Intervention G4 (Step 4)

t3 Period 4

t4 Period 5

(b) Results Arm

G1 k1 = 5

t0 Period 1 1/10 3/20 2/15 0/4

t1 Period 2

t2 Period 3 5/23 16/58 6/29 21/34

7/15 G2 k2 = 4

12/21 20/71

25/69

0/12 1/8 0/7

3/8 0/6 1/24

G3 k3 = 5

G4 k4 = 4

Control

33/67

18/34

0/2 1/13 2/15 0/10

0/1 3/15 2/10 3/10

3/7

11/34 0/6 0/9

9/15 0/4 0/5 Intervention

Figure 17.3 (a) Design: SWD with a run-in assessment of all clusters, followed by a phased implementation of targeted training (TT) sessions for emergency unit personnel, (b) results: proportion of ischaemic heart patients receiving thrombolysis therapy prior and post-TT. Source: Based on Haesebaert, Nighoghossian, Mercier, et al. (2018).

396

17 STEPPED WEDGE DESIGNS

The design characteristics of the SWDs used as examples in this chapter are summarised in Figure 17.4.

Type of SWD design

(a) Haesebaert, et al (2018) Full Cross-sectional

(b) Poldervaart, et al (2013) Full Cross-sectional

(c) Mhurchu, et al (2013) Incomplete Closed Cohort

(d) Jordan, et al (2015) Full Closed Cohort

2

2

2

2

10 Clinics

16 Schools

5 Care Homes

1

1

10 1 of 1 month

3 or 4 schools per term 4 0

5 1

Interventions allocated at each step

g

Planned number of Clusters

K

Clusters per arm

k

Arms Pre-roll-out periods Times postrandomisation Steps

A b

18 Emergency Units Two of 4 and two of 5 4 1

T

4

10

4

5

S

4

10 of 1 month

4 school terms

5

Periods

P=b+T

5

11

4

6

Cells

K × P

18 × 5 = 90

10 × 11 = 110

16 × 4 = 64

5 × 6 = 30

Participants Anticipated

900

6,600

400

50

Clusters actually involved Participants achieved Mean subjects per crosssection cell, m Mean subjects per closed cluster, n

18 691

9 3,648

14 424

5 43

7.7

36.8

-

-

-

-

30.3

8.3

Figure 17.4 Planning design characteristics and eventual sample size of the SWDs used as examples in this chapter

17.3

Basic structure

The basic structure of a SWD design can be represented by a design matrix D with A rows and P columns. In broad terms, the rows represent what happens to each cluster over the whole period of the trial whilst the columns indicate whether there is a roll-out, and after that which intervention each cluster receives with NoI coded ‘0’ and YesI coded ‘1.’ If, in the chosen design, K > A then there will be K rather than A rows in D. 17.3.1 Full design A SWD that includes a run-in of b(≥1) blocks is termed a Full design in which there is at least one roll-out assessment, then T = S, so that the number of periods becomes P = b + S, and all clusters receive YesI in the final Period. The general form of the corresponding design matrix is illustrated in Figure 17.5.

17.3 BASIC STRUCTURE

397

0

D (A × P) =

1

1





1

1

0

0

1





1

1

0

0

0





1

1





























0

0

0





0

1

Figure 17.5 The A (=K) by P design matrix, D, for a Full SWD with, b = 1, g = 2 with k = 1 new cluster allocated to YesI in each successive period after the roll-out

Example 17.2 Design matrix – ischaemic stroke The basic characteristics of the trial conducted by Haesebaert, Nighoghossian, Mercier, et al. (2018) are summarised in Figure 17.4a. The corresponding design matrix, D, involves K = 18 clusters (hence K > A = 4), all assessed at the run-in, and then divided into groups of 4 or 5, with each cluster group then assigned to one arm comprising of one of the four binary sequences of Figure 17.6. For example, over P = 5 periods, cluster group G3 would be allocated sequence 00011 and thus have the preroll-out (usual care – NoI) in Period 1 followed by NoI, NoI, YesI and YesI in the following four periods of the design. The complete design, including the run-in, involves C = 90 assessment cells, of which 44 receive NoI and 46 are allocated YesI. Once a cluster receives YesI then that intervention remains in place for the remaining period of the design. This need not be the case in every SWD as it will depend on the questions raised when proposing such a design. To emphasise that the Baseline run-in is not part of the randomisation sequence, it is indicated by (0:). This also indicates the possibility that procedures operational at Baseline could differ from the NoI which is adopted by the chosen design.

G1 k1 = 5

G2 k2 = 4 D (18 × 5) G3 k3 = 5

G4 k4 = 4

=

0: 0: 0: 0: 0: 0: 0: 0: 0: 0: 0: 0: 0: 0: 0: 0: 0: 0:

1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Figure 17.6 Design matrix, D, corresponding to the full SWD of Figure 17.3a. Source: Based on Haesebaert, Nighoghossian, Mercier, et al. (2018).

398

17 STEPPED WEDGE DESIGNS

17.3.2 Incomplete and other designs An Incomplete or Reduced design has no ‘pre-roll-out,’ in which case b = 0, and the number of periods, P = maximum (S, T). In fact, there are so many possible alternative designs that it is not possible to catalogue them here. For example, a SWD might commence in the way described above, but once a cluster has been allocated YesI then this may be confined to an implementation of only a single period rather than continuing with the YesI until all periods in the design are complete.

17.4

Randomisation

As can be seen from the design matrices of Figures 17.5 and 17.6, each row represents a binary sequence, beginning with zero, 0, if a run-in assessment is involved. Thereafter and post-randomisation, and depending on the specific design chosen, ranging from sequences with the first cells containing ‘0’ and the remaining cells containing ‘1,’ to all cells containing ‘0’ apart from the last in the sequence with ‘1.’ The above essentially defines a series of A (binary digit) arms beginning with 0 (NoI) and continuing with 0 until 1 (YesI) is allocated and thereafter continuing as 1 until the trial ends. As a consequence, in a particular trial setting all A-arms can be identified at the design stage by their binary sequence, and the usual practice is to allocate these at random to all the clusters before the trial commences rather than randomise clusters sequentially over time after each period is completed.

Example 17.3

Randomisation

The trial of Haesebaert, Nighoghossian, Mercier, et al. (2018) illustrated in Figure 17.3a uses four different randomisation sequences: 0 : 1111, 0 : 0111, 0 : 0011 and 0 : 0001 which are allocated randomly to 18 clusters. However, the randomisation was blocked to ensure as much equality of allocation to each of the sequences as is possible. Thus, the first sequence was allocated to 5 clusters, the second to 4, the third to 5 and the fourth to 4.

17.5

Cross-sectional design

In a Cross-sectional SWD (CS-SWD) design, a different group of m participants are recruited within each cell of every cluster so that by the end of the trial a total of N = mC participants are involved, where C is the total number of cells in the chosen design, and from whom the individual outcome variable is assessed. In practice, the number of participants from each cluster is likely to vary so that the total number of participants will be N = Ci= 1 mi .

17.5 CROSS-SECTIONAL DESIGN

399

17.5.1 Analysis Although in other chapters we have described, for example, sample size calculations before aspects of analysis, we have taken the opposite approach here to facilitate the explanation of some rather complex issues. The basic structure for the analysis appears to follow that for a combination of a cluster randomised trial of the types illustrated in Chapter 16 together with a repeated measures design of Chapter 14. However, although the CS-SWD is longitudinal in nature, the fact that entirely different subjects are entered in each cell of a particular cluster implies that there is no auto-correlation involved. Nevertheless, because data are obtained from each cluster involved at each period of time, due note of this association may have to be taken. 17.5.1.1 Before-and-after In essence, the CS-SWD can be thought of as a Before-and-After design. Thus, the basic item for analysis is the difference within each cluster between the pre-intervention (NoI) summary values and the post-intervention (YesI) summaries of the endpoint of concern.

Example 17.4 Simplified analysis – ischaemic stroke The trial conducted by Haesebaert, Nighoghossian, Mercier, et al. (2018) is of the CS-SWD type. Each EU is a unique cluster which contributes information from essentially different participants in each of the five periods of time. Each cell therefore potentially provides a single (and independent) estimate of the proportion of patients receiving thrombolysis within 4.5h based on a variable number of patients admitted to the EU during each period. However, the authors only give details of the summary proportion for each EU for all pre-intervention control (NoI) cells of a cluster combined and all corresponding post-intervention (YesI) cells combined. Their results are summarised in Figure 17.3b and partially in Figure 17.7a, the latter also giving the differences, Diff = YesI − NoI. Although each cluster within a group is allocated the same arm, there is quite a lot of variation in outcome between the clusters concerned. For example, in the centres within arm G1, the median Diff = 0.13 and ranges from 0.07 to 0.62. Although the pre- and post-intervention summaries are proportions, we are assuming the corresponding difference, Diff, has an approximately normal distribution. This assumption can be verified as reasonable or not by using the command (pnorm Diff) to give the standardised normal probability plot of Figure 17.7b. Since the plotted points lie (more-or-less) along the sloping 45 line, this suggests that use of the paired t-test for the analysis, as summarised in Figure 17.7c, would seem reasonable. This analysis indicates an advantage of YesI over NoI of 12.0% more receiving thrombolysis with associated 95% CI 3.3–20.8%, p-value = 0.0099. However, to indicate how the phased timing for initiating YesI affects the magnitude of the advantage over NoI the appropriate command is (regress Diff i.Arm). The edited output of Figure 17.7d suggests that G1 with the most experience of YesI (over four periods) leads to the greatest advantage since

400

17 STEPPED WEDGE DESIGNS

Example 17.4 (Continued) Diff1 = 0.2085 + 0 or 20.9% for YesI over NoI. For Arm G2, Diff2 = 0.2085 − 0.1154 = 0.0931 and similarly Diff3 = 0.0567 and Diff4 = 0.1013. These three arms also indicate a benefit (since Diff > 0) but to a lesser extent than G1. The p-values for the arms are obtained from one-sample paired t-tests of the null hypothesis of no difference for the individual arms concerned. (a)

(b)

list EU Arm NoI YesI Diff

pnorm Diff

1.00 Normal F[(Diff-m)/s]

+---------------------------+ | EU Arm NoI YesI Diff | |---------------------------| | 1 G1 0.10 0.22 0.12 | | 2 G1 0.15 0.28 0.13 | | 3 G1 0.13 0.21 0.08 | | 4 G1 0.00 0.62 0.62 | | 5 G1 0.47 0.57 0.10 | |---------------------------| | 6 G2 0.28 0.36 0.08 | | 7 G2 0.00 0.38 0.38 | | 8 G2 0.13 0.00 -0.13 | | 9 G2 0.00 0.04 0.04 | |---------------------------| | 10 G3 0.49 0.53 0.04 | | 11 G3 0.00 0.00 0.00 | | 12 G3 0.08 0.20 0.12 | | 13 G3 0.13 0.20 0.07 | | 14 G3 0.00 . . | |---------------------------| | 15 G4 0.30 0.43 0.13 | | 16 G4 0.32 0.60 0.28 | | 17 G4 0.00 0.00 0.00 | | 18 G4 0.00 0.00 0.00 | +---------------------------|

0.75

0.50

0.25

0.00 0.00

0.25

0.50

0.75

1.00

Empirical P[i] = i/(N + 1)

(c) Comparison of YesI – NoI using a paired t-test ttest Diff==0 ----------------------------------------------------Variable | Obs Mean SE [95% CI] ---------+------------------------------------------Diff | 17 0.1204 0.0411 0.0332 to 0.2076 ----------------------------------------------------t = 2.927, degrees of freedom = 16, p-value = 0.0099

(d) Comparison of YesI – NoI using a paired t-test within each randomisation arm regress Diff i.Arm ------------------------------------------------------Arm | Coef SE. Diff t P>|t| ----------+-------------------------------------------Constant | 0.2085 0.0785 G1 | 0 0.2085 2.66 0.020 G2 | -0.1154 0.1177 0.0931 -0.98 0.35 G3 | -0.1518 0.1177 0.0567 -1.29 0.22 G4 | -0.1072 0.1177 0.1013 -0.91 0.38 -------------------------------------------------------

Figure 17.7 Simplified analysis of a CS-SWD design comparing the proportion of confirmed ischaemic patients receiving thrombolysis within 4.5 hours between pre- and post-intervention from 17 of 18 Emergency Units. Source: Data from Haesebaert, Nighoghossian, Mercier, et al. (2018).

17.5 CROSS-SECTIONAL DESIGN

401

17.5.1.2 Modelling approach If the data from the subjects in each cluster for all the individual periods concerned are available for analysis, then an appropriate regression model for a CS-SWD with a continuous outcome extends Equation (16.2) for a single period cluster trial to add one or more repeated (longitudinal in time) elements in which the cluster is again investigated. The interrelationships between Time itself, Period and Arm are indicated in Figure 17.8 for a SWD including a roll-out assessment (hence b = 1). In general, if A intervention (treatment) arms are included in the design, these may be allocated to K > A clusters. However, to simplify the discussion a regression model for the design with A-arms, each arm allocated to one of the K = A clusters all of size m, is described. Consistent with the design of the CS-SWD, the regression model is extended to include time. This can be represented either by Time denoted as 0, 1, …, T or by Period denoted by 1, 2, …, T + 1. Thus, expressed in terms of Time, yitk = β0 + βTreat xtk + βTime vtk + γ ∗k + η∗tk + εitk ,

(17.1)

where i = 1, 2, …, m subjects in each cluster, clusters k = 1, 2, …, K, steps are then taken at times t = 0, 1, 2, …, T include the pre-roll-out period, xtk = 1 if YesI is allocated at t, whilst xtk = 0 if NoI. The coefficients βTreat and βTime are fixed effects. The two random effects terms are γ ∗k and η∗tk that account for cluster and times within a cluster, respectively, and εitk is the random error term. These are assumed to have independent normal distributions with mean 0, and corresponding variances, σ 2Cluster , σ 2Time Cluster and σ 2Random , respectively. The regression model of Equation (17.1) includes the term the βTimevtk and regards Time as an ordered numerical variable taking values 0, 1, 2 and 3 as in Figure 17.8. However, this presumes a linear effect of Time over the successive Periods of the SWD. Alternatively, Time may be regarded as a four-group unordered categorical variable. In such a case the term, βTimevtk, would be replaced and expressed using dummy variables using the more complex form γ Time1v1tk + γ Time2v2tk + γ Time3v3tk, created in a similar way to that of Equation (8.21), in the model. However, the influence of time itself within each cluster as it progresses through the successive periods of the design may well vary in the pre-intervention stage from the postintervention stage. Thus, for example, the time profiles in Cluster 3 of Figure 17.8 may differ in the pre-intervention Periods 1 and 2 from the post-intervention Periods 3 and 4. Finally, although Arm is also linked to time, in that post-intervention periods follow pre-intervention periods, it does not fully capture the effect of time. Thus, in the results from Haesebaert, Nighoghossian, Mercier, et al., (2018) summarised in Figure 17.3b, clusters in Arm G3 merge Period 1, 2 and 3 data and compare these with Periods 4 and 5 combined. The influence of Arm (replacing that of Time) within the model may then be investigated using a dummy variable approach. However, there may be

402

17 STEPPED WEDGE DESIGNS Period 1

2

3

4

Time Cluster

Arm

0

1

2

3

1

1

NoI

YesI

YesI

YesI

2

1

NoI

YesI

YesI

YesI

3

2

NoI

NoI

YesI

YesI

4

3

NoI

NoI

NoI

YesI

5

3

NoI

NoI

NoI

YesI

Period

Time

Arm

Variable type Dummy

X

X

X

Numerical



X



Figure 17.8 The relationships between Time, Period and Arm, in a SWD

loss of information since the comparison is made based on whether the outcome was measured pre- or post-intervention. The naïve analysis summarised in Figure 17.9c essentially ignores the impact of the cluster design and the potential influences of time. Ignoring the cluster is likely to impact on the estimate of the OR and (more likely) the associated 95% CI. To take this into account, the (logistic) section of the command is replaced by the (melogit) which allows for the auto-correlation of subjects within a cluster. We note that Examples 17.5 and 17.6 do not describe a full analysis of the trial of Haesebaert, Nighoghossian, Mercier, et al. (2018) but are merely used to illustrate the methodology. A full analysis would take into account the covariates recorded by the authors, such as patient age, admission during the day or night, and EMS call or no call.

Example 17.5

Simple logistic regression – ischaemic stroke

The endpoint measure for the trial conducted by Haesebaert, Nighoghossian, Mercier, et al. (2018) is the difference, Diff, in the proportion of patients receiving thrombolysis between those in NoI and YesI groups. In contrast to the approach of Example 17.4, the more appropriate analysis of the difference involves use of logistic regression. However, to implement the logistic regression analysis requires reformatting the data of Figure 17.3b into a single row for each individual concerned in the data file comprising columns including EU, Arm, Inter, Time and a binary 0 or 1 variable Throm representing No and Yes – a small selection of which is given in Figure 17.9a. The crosstabulation of Figure 17.9b shows the data from the 691 individuals concerned and suggests an 8.5% improvement in the thrombolysis rate after the EU professionals have received targeted training. The command (logistic Throm Inter, or) of Figure 17.9c essentially replicates

17.5 CROSS-SECTIONAL DESIGN

Example 17.5

403

(Continued)

that using the crosstabulation of Figure 17.9b to provide the OR and an associated 95% CI. This indicates an OR = 1.51 (p-value = 0.015) in favour of YesI. (a)

(b)

+----------------------------------+ | EU Arm Inter Time Throm | |----------------------------------| | 8 G2 NoI 0 No | | 8 G2 NoI 0 No | | 8 G2 NoI 0 No | | 8 G2 NoI 1 No | | 8 G2 NoI 1 No | | 8 G2 NoI 1 No | | 8 G2 NoI 1 No | |----------------------------------| | 8 G2 NoI 0 Yes | | 8 G2 YesI 2 No | | 8 G2 YesI 3 No | | 8 G2 YesI 3 No | | 8 G2 YesI 4 No | | 8 G2 YesI 4 No | | 8 G2 YesI 4 No | +----------------------------------+

tabulate Inter Throm -------- +-------------- +------| Thrombolysis | Treat | No Yes | Total -------- +-------------- +------NoI | 244 84 | 328 YesI | 239 124 | 363 -------- +-------------- +------Total | 483 208 | 691 -------- +-------------- +------pNoI = 84/328 = 0.2561 pYesI = 124/363 = 0.3416 pYesI – pNoI = 0.3416 – 0.2516 = 0.0855 or 8.5% OR =

0.3416/(1−0.3416) = 1.5071. 0.2561/(1−0.2561)

(c) logistic Throm Inter, or Logistic regression

Number of obs = 691

-------------------------------------------------------------Throm | OR SE z p-value [95% CI] --------+----------------------------------------------------Inter | 1.5071 0.2533 2.44 0.015 1.0841 to 2.0951 --------------------------------------------------------------

Figure 17.9 A naïve analysis comparing the proportion of confirmed ischaemic patients receiving thrombolysis. Source: Data from Haesebaert, Nighoghossian, Mercier, et al. (2018).

Example 17.6

Accounting for cluster and time – ischaemic stroke

To adjust for the random effect of the EUs, the command required for the trial of Haesebaert, Nighoghossian, Mercier, et al. (2018) is (melogit Throm Inter || EU:, or). From Figure 17.10a, the resulting OR = 1.85 (p-value = 0.0013) is favourable to YesI with 95% CI of 1.27 to 2.70, whereas the earlier result of Figure 17.9c ignoring the clustering gave OR = 1.51. In order to account for the influence of Time on the relative effect of the intervention, YesI in a CS-SWD, the command is extended to (melogit Throm Inter i.Time|| EU: || Time:, or), where the (i.Time) term indicates that the required dummy variables are generated. As shown in Figure 17.10b, compared to Time 0 (the pre-roll-out Period 1), with an OR set equal to unity, none of the other periods differ from that significantly as the respective p-values are 0.68, 0.70, 0.39 and 0.65. However, as compared to the model not including Time,

Example 17.6

(Continued)

the intervention effect is reduced to OR = 1.499 with a non-statistically significant p-value = 0.21. When (linear) Time, as an ordered numerical variable, is in the model command of Figure 17.10c, then Time is not statistically significant. Nevertheless, it marginally increases the estimated intervention effect from OR = 1.499 with p-value = 0.18. (a) Fixed effect of the intervention and random effect of cluster melogit Throm Inter || EU:, or Mixed-effects logistic regression Number of obs = 691, Group variable: EU, Number of groups = 18 --------------------------------------------------------------Throm | OR SE z p -value [95% CI] --------+-----------------------------------------------------Inter | 1.8515 0.3553 3.21 0.0013 1.2712 to 2.6969 --------+------------------------------------------------------

(b) Fixed effects of intervention and categorical time, and random effects of cluster and time within cluster melogit Throm Inter i.Time || EU: || Time, or -----------------------------------------------------------Throm | OR SE z p -value [95% CI] -------+---------------------------------------------------Inter | 1.4994 0.4892 1.24 0.21 0.7911 to 2.8421 Time | 0 | 1 1 | 0.8829 0.2653 -0.41 0.68 2 | 1.1427 0.3942 0.39 0.70 3 | 1.4125 0.571 9 0.85 0.39 4 | 1.2213 0.5316 0.46 0.65 -------+----------------------------------------------------

(c) Fixed effects of intervention and linear time, and random effects of cluster and time within cluster melogit Throm Inter Time || EU: || Time:, or -----------------------------------------------------------Throm | OR SE z p -value [95% CI] --------+--------------------------------------------------Inter | 1.5295 0.4877 1.33 0.18 0.8187 to 2.8575 Time | 1.0814 0.1128 0.75 0.45 -------+---------------------------------------------------

Figure 17.10 Analysis of a CS-SWD comparing the proportion of confirmed ischaemic patients receiving thrombolysis. (a) Fixed effect of the intervention and random effect of cluster. (b) Fixed effects of intervention and categorial time, and random effects of cluster and time within cluster. (c) Fixed effects of intervention and linear time, and random effects of cluster and time within cluster. Source: Data from Haesebaert, Nighoghossian, Mercier, et al. (2018).

17.5 CROSS-SECTIONAL DESIGN

405

17.5.2 Design effect As we have indicated in describing model (17.1), there are now two cluster component variances σ 2Cluster and σ 2Time Cluster that we have to consider as well as σ 2Random . Consequently, the quantities of Figure 17.11 are obtained and these can be seen to involve an extension of the definition of the ICC of Equation (16.4). The fact that clusters are involved in the CS-SWD implies that note of the ICC within each will need to be taken into account. Further, since each cluster is a segment of a longitudinal design, the possibility of auto-correlation also needs to be considered. In general, the Design Effect (DE) for a CS-SWD will incorporate both of these correlations for trial size determination. As in the case of the cluster designs of Chapter 16, the influence of the intra-class correlation is expressed through the DE of Equation (16.5) which we repeat here for convenience DECluster = 1 + m − 1 ρ

(17.2)

Here m is the number of subjects per cell and the components of ρ are given in Figure 17.11. Hooper, Teerenstra, de Hoop and Eldridge (2016) define the correlation, rCS, between two sample means on m participants from the same cluster in different periods of a CS-SWD as r CS =

mρω , 1 + m−1 ρ

(17.3)

where ρ is the ICC and ω the cluster auto-correlation as defined in Figure 17.11.

Total variance: Intra Class Correlation (ICC):

σ 2Total = σ 2Random + σ 2Time|Cluster + σ 2Cluster ρ = (σ 2Time|Cluster + σ 2Cluster)/σ 2Total

The correlation between assessments of two individuals from the same cluster at the same time. ω = σ 2Cluster /(σ 2Time|Cluster + σ 2Cluster) Cluster auto-correlation: The correlation between two population means from the same cluster at different times. If σ 2Time|Cluster = 0, then ω = 1 and ρ = σ 2Cluster /σ 2Total Note: For brevity σ 2Between–Cluster of equation (16.8) is abbrevated to σ 2Cluster.

Figure 17.11 Definition of intra-class correlation coefficient (ICC), ρ, together with the cluster auto-correlation, ω, for a CS-SWD design

406

17 STEPPED WEDGE DESIGNS

In general terms, the DE when considering the longitudinal aspects of the design is given by DERepeat r =

A2 1 − r 1 + Tr , 4 AE − G + E2 + TAE − TG − AF r

(17.4)

where r depends on the associated auto-correlation structure, and 2

E=

xlt , F = lt

xlt

2

,G =

t

l

xlt t

,

(17.5)

l

which are calculated from the binary elements of the matrix D for the specific design in question. For the Full designs of the type of Figure 17.1, and assuming k ≥ 1 is the same number of clusters assigned to each arm, these take the values: kT(T + 1)/2, kT(T + 1) (2T + 1)/6 and kT(T + 1)(2T + 1)/6 respectively. Some examples of the CS design matrix, D, with the corresponding values of DERepeat(r) are given in Figure 17.12. (i) For, b = 1, A = 2 arms and k = 1, T= S = 1 step, hence P = 2, and D (2 × 2) =

0 0

0 . 1

Further, E = F = G = 1 and using equation (17.4) the corresponding combined DERepeat(rCS) = (1 – r 2CS). (ii) For b = 0, A = 2 and k = 1, YesI allocated to one cluster immediately so T = S = 2, hence P = 2 and D (2 × 2) =

1 0

1 . 1

For b = 1, A = 2 and k = 1, T = S = 2, P = 3 and D (2 × 3) = In both cases, E = 3, F = G = 5 and DERepeat(rCS) = (iii)

0 0

1 0

1 . 1

(1 − rCS)(1 + 2rCS) . (1 + rCS)

For b = 0, A = 3 and k = 1, YesI allocated in a total of T = S = 3 introducing YesI at each 1

1

1

stage, then P = 3 and D (3 × 3) = 0

1

1 .

0

0

1

0111 For b = 1, A = 3 and k = 1, T = S = 3, P = 4 and D (3 × 4) = 0 0 1 1 0 001 In both cases, E = 6, F = G = 14, and DERepeat(r) =

9(1 − rCS)(1 + 3rCS) . 8(2 + 3rCS)

Figure 17.12 Examples of the CS-SWD design matrix, D, with the corresponding values of DERepeat(rCS)

17.5 CROSS-SECTIONAL DESIGN

407

Finally, for trial size estimation purposes, combining DECluster with DERepeat(rCS) gives the DE for a CS design as DECS = DECluster × DERepeat r CS

(17.6)

17.5.3 Trial size If the outcome variable is continuous, the number of subjects required for a two-group individually randomised trial, N0, is given by Equation (9.5). However, this is modified by multiplying by DECS and, as for a full design with b = 1, there will be P = 1 + T periods in the SWD, this will be further increased by multiplying by P. Thus, the planned total sample size is N CS = DECS × 1 + T × N 0

(17.7)

The corresponding number of clusters is estimated by K CS = DECS ×

N0 m

(17.8)

When there is likely to be variation in the number of subjects recruited per cluster, m in Equation (17.8), is replaced by the mean cluster size, m.

Example 17.7

Cross-sectional – acute chest pain

The structure of the CS-SWD trial described by Poldervaart, Reitsma, Hoffijberg, et al. (2013), which is summarised in Figure 17.4b, aimed to compare the use of the HEART score (H) against Usual care (U) for use in the early assessment of patients with acute chest pain presenting at emergency departments. The completed trial described by Poldervaart, Reitsma, Backus, et al. (2017) concluded ‘Using the HEART score … the effect on health care resources is limited, ….’ Nevertheless, we suppose the CS-SWD is to be replicated in a different setting but with the anticipation that the use of H will detect 3% more cases than that anticipated by U of 15%. Further, it is anticipated that 10 hospitals will participate and each contributes 60 patients per month. A 2-sided test size of 5% and a power of 80% is set. Using Equation (9.6) for a binary outcome with φ = 1 gives an initial estimate of sample size, N0 = 4803.77 subjects. Assuming the between-hospital variation in outcome of 14 to 16% represents approximately 4σ Cluster, then σ Cluster = 0.02/4 = 0.005. Further assuming σ 2Time Cluster = 0 and taking the mean proportion of cases as π = π U + π H 2 = 0.165 in Equation (16.8), suggests the ICC of Figure 17.11 as ρ =

σ 2Cluster π 1−π

=

0 0052 0 165 × 0 835

= 0 0001815

408

17 STEPPED WEDGE DESIGNS

Example 17.7

(Continued)

If we assume the cluster auto-correlation ω = 1 then, with m = 60, ×1 Equation (17.3) gives r CS = 1 +6060× −0 10001815 × 0 0001815 = 0.010772. As the CS-SWD has a full design, Equation (17.5) with A = T = 10, E =

T T+1 2

= 55

T T + 1 2T + 1 6

= 385 The use of Equation (17.4) leads to and F = G = DERepeat(rCS) = 0.15754 and DECluster = 1.01071 from which Equation (17.6) gives DECS = DECluster × DERepeat(rCS) = 0.15923. The final sample size from Equation (17.7) is NCS = 0.15923 × (10 + 1) × 4803.77 = 8413.94 or 8400. This implies, from Equation (17.8) that the number of clusters KCS = 0.15923 × 4803.77/ 60 = 12.75 or 14 with C = KCS × P = 154 cells for the same design of Figure 17.4b but with modified effect size stipulation.

17.6

Closed cohort design

As indicated earlier, in a Closed Cohort (CC) SWD design, the same group of m participants is retained in a cluster throughout every period of the design. Thus, it is assumed that the participants in a given cluster are all identified at the beginning of the trial and are assessed at a series of predefined, discrete times following randomisation to eventually comprise a total of m (b + T) endpoint variable assessments.

Example 17.8

Closed cohort – free breakfast

In the CC-SWD of Figure 17.13 described by Mhurchu, Gorton, Turley, et al., (2013) for evaluating a Free Breakfast (FB) as the Intervention, against No Free Breakfast (NoFB) as Control, schools form the clusters and the periods were school terms. As indicated in Figure 17.4c the trial planned to recruit 400 children in K = 16 clusters, S = 4 steps, k = 3 or 4 clusters allocated at each step to FB whilst any remaining schools continued as NoFB. The design has P = 4 periods and C = K × P = 64 cells and anticipated m students per school. These same students would provide a unit of assessment for each of the four periods.

17.6 CLOSED COHORT DESIGN

Example 17.8

409

(Continued)

Term 1 Demographics and assess

Intervention

Control

Control

Control

Term 2 Assess

Intervention

Intervention

Control

Control

Term 3 Assess

Intervention

Intervention

Intervention

Control

Term 4 Assess

Intervention

Intervention

Intervention

Intervention

Figure 17.13 CC-SWD for evaluating a free school breakfast programme involving 14 schools. Each arm (denoted Term) was allocated to either three or four schools. Source: Mhurchu, Gorton, Turley, et al. (2013). © 2013 BMJ Publishing Group Ltd

Example 17.9

Closed cohort – nurse-led medicines’ monitoring

The features of the CC-SWD conducted by Jordan, Gabe-Walters, Watkins, et al. (2015) are summarised in Figure 17.4d and their results are given in Figure 17.14. The trial concerned the monitoring of problems associated with antipsychotic, antidepressant or antiepileptic medication in care home residents with dementia. The trial implemented nurse-led medicines’ monitoring (NM) against usual care (U) and the same cohort of patients were assessed in each care home over the whole trial period. One endpoint concerned the number of problems associated with the individual patient in each of six periods from which the corresponding mean and standard deviation are tabulated in Figure 17.14. From these data, it can be seen that the number of patients from which documentation is obtained varied from period to period within Care Homes 4 and 5 so that (as appropriate) the Before-and-After intervention means obtained are first weighted by the number of observations within the cells. The Before–After differences in the means suggest an increase in the frequency of reported problems identified by the respective nurses in each

410

17 STEPPED WEDGE DESIGNS

Example 17.9

(Continued)

of the care homes. In fact a simple paired t-test using the mean difference (Diff = 7.746, SD = 3.132) from the five care homes gives t = 7.746/(3.132/ √5) = 5.53 with df = 4 which, from Table T4 of Student’s t-distribution, implies a p-value ≈ 0.005. More complete tables give p-value = 0.0052 and lead to the corresponding 95% CI as 3.9–11.6.

Period

Care Home

1

2

3

4

5

1

2

3

4

Before After 5

6

n

10

10

10

10

10

10

Mean

10.30

17.30

17.30

17.00

16.50

17.10

SD

2.21

6.08

6.09

5.25

5.19

4.61

n

8

8

8

8

8

8

Mean

8.13

5.38

17.75

12.75

12.88

14.38

SD

3.72

3.58

6.94

5.90

6.31

5.73

n

5

5

5

5

5

5

Mean

8.20

6.60

6.80

10.60

9.40

12.20

SD

4.09

2.30

4.32

2.07

1.14

2.39

n

10

10

10

9

9

9

Mean

9.10

6.20

7.00

4.56

18.33

19.11

SD

3.81

2.49

2.63

2.30

7.57

6.79

n

8

9

10

9

9

9

Mean

6.38

6.78

7.50

7.70

7.40

16.50

SD

2.00

1.92

2.32

3.65

3.89

6.59

Diff

B

A

A–B

10.30

17.04

6.74

6.76

14.44

7.68

7.20

10.73

3.53

6.77

18.72

11.95

7.18

16.50

9.32

Figure 17.14 Number of observations, mean and standard deviation of the number of problems recorded for all participants with dementia within successive cells within five care homes. Source: Based on Jordan, Gabe-Walters, Watkins, et al. (2015).

17.6.1 Analysis In a CC design, the same group of m participants is retained within each cluster at every step thus, although a total of m (b + T) assessments will be made, the number of

17.6 CLOSED COHORT DESIGN

411

participants remains at m per cluster. In which case, the model of Equation (17.1) for a continuous endpoint CC-SWD comparing NoI and YesI, is extended by adding ζ∗ikt to account for the individual subject auto-correlation. Thus, yitk = β0 + βTreat xtk + βTime t + γ ∗k + η∗tk + ζ∗itk + εitk

(17.9)

The additional random effects term, ζ ∗ikt , is assumed to have normal distribution with mean 0, and variance, σ 2Indiv Cluster . 17.6.2 Design effect In a CC-SWD design, the individuals recruited within a cluster are then followed and assessed in every period of the design. As a consequence, the individual auto-correlation is also present and, as we have indicated, the variance term σ 2Indiv Cluster then has to be considered. This impacts on the total variance, σ 2Total, and contributes to the individual auto-correlation, θ, as indicated in Figure 17.15.

Total variance:

σ 2Total = σ 2Random + σ 2Indiv | Cluster + σ 2Time | Cluster + σ 2Cluster

Intra Class Correlation (ICC):

ρ = ( σ 2Time | Cluster + σ 2Cluster)/σ 2Total

Cluster auto-correlation:

ω = σ 2Cluster /( σ 2Time | Cluster + σ 2Cluster)

Individual auto-correlation

θ = σ 2Indiv | Cluster / ( σ 2Random + σ 2Indiv | Cluster )

The correlation between two assessments of the same individual at different times in a given cluster

Figure 17.15 Components of total variance for a CC-SWD design

The specific CC-SWD design is affected by three items: (i) correlation of individuals within a cluster, (ii) the possible auto-correlation of an individual cluster with itself as it passes through the CC-SWD, and now (iii) the correlation of the individual participants with themselves as they pass through the CC-SWD. As a consequence, the corresponding DE needs to account for all three of these. In fact, the DE takes the same form as Equation (17.4) and all that is necessary to change is to redefine r by

412

17 STEPPED WEDGE DESIGNS

r CC =

mρω + 1 − ρ θ , 1 + m−1 ρ

(17.10)

and this is denoted DERepeat(rCC). Some examples of the CC design matrix, D, with the corresponding values of DERepeat(rCC) are given in Figure 17.16 Design with each cluster within each period to be assessed on τ occasions No run-in (b = 0), A arms and S steps For A = 2 arms and τ = 2 repeated measures per period, YesI allocated to one group, with one further step introducing YesI at the second stage to the other group, then T= 4, S = 2, P = 2 and D (2 × 4) =

1 0

1 1 0 1

1 . From equations (17.5), E = 6, F = 42 + 22 = 20 and 1

G = 12 + 12 + 22 + 22 = 10 and using these in equation (17.4), gives DERepeat (rCC) = (1 − rCC)(1 + 4rCC) . 2(1 + 2rCC) For A = 3 arms and τ = 2 repeated measures per period, YesI allocated to one group, with two further steps introducing YesI at each stage, then T = 6, S = 3, P= 3 and D (3 × 6) = 1 0 0

1 0 0

1 1 1 1 0 0

1 1 1

1 1 . In this case E = 12, F = 62 + 42 + 22 = 56 and G = 12 + 12+ 22 + 22 + 32 1

+ 32 = 28 and using these in equation (17.4), gives DERepeat (rCC) =

Figure 17.16

9(1 − rCC)(1 + 6rCC) . 32(1 + 3rCC)

Examples of the CC design matrix D, with the corresponding values of DERepeat(rCC)

Finally, for trial size estimation purposes, combining DECluster with DERepeat(rCC) gives the relevant DE as DECC = DECluster × DERepeat r CC

(17.11)

17.6.3 Trial size If the outcome variable is continuous, the number of subjects required for a two-group individually randomised trial, N0, is given by Equation (9.5). However, as there will be the same individuals concerned in each period of a CC design, the planned total sample size is given by N CC = DECC × N 0

(17.12)

17.7 PRACTICALITIES

Example 17.10

413

Closed cohort – provision of school breakfast

The intention of the trial of Mhurchu, Gorton, Turley, et al. (2013), summarised in Figure 17.4c, was to evaluate the provision of a FB in schools as compared to NoFB. The authors stated, with respect to their sample size calculation: ‘The target sample size was 16 schools (4 per arm) and an average of 25 students per school, that is, 400 participants. Assuming an intra-cluster correlation coefficient of 0.05, this would provide at least 85% power, with a 2-sided significance level of 0.05, to detect a 10% absolute change in the proportion of students with a school attendance rate of 95% or higher’. As, on the whole, the same (m ≈ 25) children will be attending the school over the four terms, this is a CC design. The trial did not include a run-in (b = 0) so that Figure 17.4c indicates A = 4 arms, P = 4 periods, S = 4 steps, T = 4 postrandomisation assessments, and K = 16 schools. They further assumed ρ = 0.05 so that, from Equation (17.2), DECluster = [1 + (25 − 1) × 0.05] = 2.2 and Equa× 0 9 + 1 − 0 05 θ tion (17.10) suggests, with ω = 0.9, that rCC = 25 ×10+05 25 = 0.5114 − 1 × 0 05 + 0.4318θ. Although the planning values for the proportion of children attending school are not given the results imply that about 52% (pNoFB = 0.52) of children with NoFB would have satisfactory attendance. With an absolute change of 10%, this is anticipated to increase to pFB = 0.62 following the FB intervention. Thus with 2-sided α = 0.05 and β = 0.15, equation (9.6) with φ = 1 gives N0 = 877.13. If θ = 0.4 is assumed, rCC = 0.6841 and using (17.4) with r = rCC, DERepeat= 0.19937 which, when combined with DECluster as in (17.11), give DECC = 0.19937 × 2.2=0.4386. Finally, from Equation (17.12) the sample size is NCC = 0.4386 × 877.13 = 384.72 or (conveniently) 384. Since K = 16 is fixed, this implies the number per school can be smaller than planned as m = 384/16 = 24. However, class sizes are determined by the schools involved rather than on statistical grounds so this lower figure might give some reassurance that the investigators will have sufficient school children (planned with 25 per class) for their purpose. The smaller the value of ω the larger the sample size, NCC. Thus, if ω = 0.7, NCC = 500 and this implies class size would have to be bigger (likely not to be possible) or (more likely) the number of schools increased.

17.7

Practicalities

In general, randomised trials using SWDs tend to be used in situations that are termed ‘complex interventions.’ This complexity refers to the choice of the interventions to compare, the logistic difficulties associated with their delivery and the multiple options for their basic (statistical) design. It is clear that whatever design option is chosen, there are many parameters involved all of which need ‘planning’ values to be provided for

414

17 STEPPED WEDGE DESIGNS

sample size calculation purposes. The impact on sample size of the numerous possible combinations of values for these parameters is likely to require extensive exploration. General guidance on the issues of concern is provided by the Medical Research Council (MRC) framework for complex interventions as described by, for example, Craig, Dieppe, Mcintyre, et al. (2008), Moore, Audrey, Barker, et al. (2015) and Copas, Lewis, Thompson, et al. (2015). A case study concerned with developing a clinical decision support intervention, focusing on the assessment and management of pain in patients with dementia is given by Dowding, Lichtner and Closs (2017). Also, of great importance is the CONSORT extension to the guidance for reporting SWDs described by Hemming, Taljaard, McKenzie, et al. (2018) and amplified by Grant, Mayo-Wilson, Montgomery, et al. (2018) and Hemming, Taljaard and Grimshaw (2019).

PART III

Further Topics

CHAPTER 18

Genomic Targets

This chapter describes situations where there may be a priori evidence of how well a particular patient responds to a specific therapy that is dependent on their genetic make-up. In this context, what is termed a predictive marker is identified and patients with a particular condition can be tested for its presence. Thus, for example, they may be categorised as marker positive or negative prior to randomisation to a clinical trial pertinent to the condition concerned. In some situations, eligibility to the trial may include both marker positive and marker negative individuals into a design stratified by biomarker status. In other situations, the subsequent trial may be confined to only those who are biomarker positive in what is then termed an enrichment design. This chapter describes these specific design types and discusses consequences for determining an appropriate sample size and the approach to analysis.

18.1

Introduction

Clinical trials work on the principle of identifying average treatment effects in clearly defined subgroups of patients, and extending the conclusions made to the wider population of such individuals. Although the eligibility criteria are carefully defined, the patients eventually recruited into the trial may nevertheless be very heterogeneous in terms of their potential response to the treatment received. For example, many cancers of the same primary site and stage are diverse in terms of their pathogenesis, natural history and responsiveness to therapy. There is thus a risk that patients may be exposed to potentially toxic treatments from which they derive no benefit due to their particular genetic make-up. The study of how genetic variations can influence disease is broadly known as Genomic Medicine. Genomic Medicine leads to different challenges for the design of clinical trials. Many drugs, particularly in oncology, are increasingly developed for defined molecular targets. For some drugs, these targets are well understood and there is a compelling biological basis for focusing therapeutic development on the subset of patients whose

Randomised Clinical Trials: Design, Practice and Reporting, Second Edition. David Machin, Peter M. Fayers, and Bee Choo Tai. © 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.

418

18

GENOMIC TARGETS

tumours are characterised by the targets. For other drugs, there may be more uncertainty about the actual value of the target concerned.

18.2

Predictive markers

18.2.1 Definition As patients with a particular disease may have different genetic make-ups, there is always the possibility that their response to treatment and/or the potential toxicities may depend on their specific genetic type. To identify such patients prospectively, and before treatment is instituted, requires an appropriate predictive classifier which is defined as: ‘A measurement made before treatment to select patients who are most likely to respond to, and thus benefit from, the treatment.’ In this context, Figure 18.1 states two necessary prerequisites prior to a randomised clinical trial being activated. Thus, at the development stage, the assumption is that a single classifier, and associated validated diagnostic test, is already available for use in the trial so that patients can be classified as biomarker positive (B+) or biomarker negative (B−) before their eligibility is confirmed and randomisation to Standard (S) or Test (T) is activated. Although the predictive biomarker has to be identified, this need not imply that it will always be of utility in the therapeutic situation in which it is being tested. (i) Development of a predictive classifier (based on a molecular target) using preclinical and early phase clinical studies (ii) Development of a validated test for the measurement of that classifier

Figure 18.1 Predictive classifier for randomised trials involving molecular targets. Source: Based on Simon (2005).

Example 18.1 Predictive classifier – epidermal growth factor receptor (EGFR) The epidermal growth receptor (EGFR) was identified as a predictive marker in a trial conducted by Ravaud, Hawkins, Gardner, et al. (2008) in patients with advanced renal cell carcinoma. In this instance, patients were first categorised as EGFR 0, 1+, 2+ or 3+, with EGFR 3+ regarded as overexpressed and hence in our notation regarded as B+ and the remainder B−. Additionally, if the human epidermal growth factor 2 (HER-2) is categorised in the same way, then those 1+ or greater are regarded as overexpressed. In the event, 57.9% of 416 patients with renal cancer were EGFR+ but only 5.4% of 389 were HER-2+.

18.2 PREDICTIVE MARKERS

419

Example 18.2 Potential biomarker – EGFR amplification and KRAS mutations Garassino, Martelli, Broggini, et al. (2013) conducted a RCT of Erlotinib (E) versus Docetaxel (D) in patients with non-small-cell lung cancer in which biomarkers EGFR amplification and protein expression, and KRAS mutations were determined. Of 540 patients genotyped, 79 (15%) with mutated EGFR tumours were excluded from entry to the trial. At the first planned interim analysis of the trial, it was also concluded that the remaining biomarkers had no influence on the magnitude of treatment differences. However, the authors do report (their Figure 3) a comparison of treatments in the 218 patients with KRAS status assessed as Mutated (23%) or wild-type (77%) in which for OS the unadjusted HR = 0.78 is suggestive of an advantage to D.

18.2.2 Implications for trial size Most of the considerations for trial size determination discussed in Chapter 9 will be of relevance here although, for illustration in this chapter, we will focus on time-to-event outcomes. The key equations necessary for determining the number of events (E) and hence subjects (N) required for estimating a HR are given in Equations (9.12) and (9.13), respectively. In RCTs described there, the research question is usually posed as: ‘Is T different from S?’ Hence, a 2-sided alternative hypothesis is implied, and so, a 2-sided test of the null hypothesis, H0: HR = 1, is anticipated. Also, randomisation to S and T is usually made on a 1 : 1 basis. Thus, we reproduce the equations here assuming a 2-sided test, 1-sided power and with an allocation ratio, φ = 1. E=

1 + HRPlan 1 − HRPlan N=

2

z1 − α

2

2

+ z1 − β ,

2E 1 − πS + 1 − πT

(18.1) (18.2)

Here, nS = nT = N/2 subjects, respectively, will be included in each of the S and T intervention groups. Further, from Equations (9.10) or (9.11), it can be deduced for given planning values for πS or MS and HRPlan that π T = exp log π S × HRPlan and M T = M S HRPlan

(18.3)

In the context of this chapter, the question posed is sometimes: ‘Is T better than S?’ – so that a 1-sided alternative hypothesis is implied. In the case of a 1-sided test, z1 − α/2 is

420

18

GENOMIC TARGETS

replaced by z1 − α in Equation (18.1). If a 1-sided test is stipulated then, at the analysis stage, a 1-sided p-value and 1-sided CI of Equation (15.1) are required. However, should T appear worse than S (in our situation HR > 1) at this stage, then the associated test statistic will result in a p-value > 0.5, and hence, the null hypothesis will not be rejected. The more extreme in this direction, that is, the worse T, the larger this p-value will be. As we have just indicated, in determining trial size, one planning decision required is whether a 1- or 2-sided test size should be utilised in the calculations. For example, Garassino, Martelli, Broggini, et al. (2013), of Example 18.2 above, use a 2-sided test size of 0.05 and a 2-sided CI when comparing Docetaxel with Erlotinib in patients of a specific biomarker type who have non-small-cell lung cancer, whilst Stone, Mandrekar, Sanford, et al. (2017), of Example 18.4 below, in patients with acute myeloid leukaemia use a 1sided 0.025 significance level for comparing Midostaurin with Placebo. In the latter case, one might argue that only an improvement over Placebo is of interest, whilst in the first, the concern is to establish which of the two drugs is superior. Hence, decisions on the use of 1- or 2-sided tests are likely to be very specific to the planned trial under consideration. 18.2.3 Role of strata In traditional stratified designs, in some sense, it does not matter what the relative proportions of eligible patients are that fall into the (say) two biomarker groups, which are deemed prognostic, as the strata adjusted summary HR will take this into account. In this situation, the anticipated effect size used for planning is essentially an ‘average’ anticipation over the two diagnostic groups. Thus, it is anticipated that if T is of overall benefit over S, the benefit applies to both diagnostic groups irrespective of their relative prognosis. In contrast, if it is anticipated that there will be a differential effect of T over S in the B− and B+ groups such that HR(−) HR(+), then the overall effect log HROver = p − log HR −

+p +

log HR +

(18.4)

may not be relevant. Further, the precision of the estimated HR within the distinct B− and B+ groups will depend on the relative proportion of patients, p(−) and p(+) = 1 − p(−), falling into these groups. Clearly, if either of p(−) or p(+) is small, then a sensible comparison of S versus T could not be made in the smaller group alone unless a very large difference is anticipated.

18.3

Enrichment design

18.3.1 Design When there is compelling prior evidence that patients who are B− for the classifier will not benefit from the treatment, the biomarker enrichment design (BrichD) of

18.3 ENRICHMENT DESIGN

421 R A Biomarker B+

N D O

Potential Subjects

M

All checked for biomarker status before randomisation

Biomarker B–

Test (T)

Standard (S)

Not eligible

Figure 18.2 Enrichment Design (BrichD) in which subjects are selected for randomisation provided they are diagnostic positive (B+) for a specific molecular target

Figure 18.2 is appropriate. This design restricts the eligibility by excluding B− patients from the randomisation to S or T. The conclusion drawn from a BrichD would be whether or not T is more effective than S for B+ patients. If it is beneficial, then future application of T would be restricted only to such patients. The BrichD is thus appropriate for the situation where there is such a strong biological basis for believing that B− patients will not benefit from T so that including them in the trial would raise important ethical concerns. Clearly, the design does not provide data on the effectiveness of T compared with S for B− patients. Consequently, unless there is this compelling evidence that T will indeed be ineffective in B− patients, BrichD would not be the design to choose. However, if B− patients were (perhaps mistakenly) to be included in the trial, and indeed T is truly ineffective over S in these patients, then the overall effect of T versus S would be diluted to an extent which would increase with increasing p(B−) and possibly disguise a worthwhile clinical benefit for those who are B+. 18.3.2 Trial size The overall efficiency of the BrichD depends on the prevalence of B+ patients amongst the potential patients with the condition of concern. If this proportion is small, then this may adversely affect the numbers that it is possible to recruit to the trial within a reasonable time-frame. On the other hand, within the B+, a much larger expected difference (effect size) may be anticipated between the S and T arms than might be anticipated if an ‘overall’ effect size had been stipulated in which both B− and B+ patients were to be recruited. This ‘larger’ effect size would then be used to calculate the trial size using Equations (18.1), (18.2) and (18.3) as appropriate.

422

18

GENOMIC TARGETS

Example 18.3 Enrichment design – renal cell carcinoma Suppose a confirmatory trial comparing Lapatinib (L) and hormone therapy (H) was to be conducted focusing on the EGFR 3+ patients similar to those included in the trial of Ravaud, Hawkins, Gardner et al. (2008) which suggested a HR = 0.69 with respect to overall survival (OS) in favour of L. Setting an effect size HR = 0.69, median OS with H as MH = 35 weeks, at which time π H = 0.5, 1-sided test size of 5% and a power of 80%, then the number of deaths to be observed from Equation (18.2) is E = 184. Further, from Equation (18.3), the values of ML = 51 weeks and π L = 0.62 are anticipated, and from Equation (18.2), the total number of B+ patients to recruit is 420. However, the earlier trial suggested 57.9% of renal carcinoma patients would be B+. This implies a total of 420/0.579 = 726 or approximately a total of N = 750 patients with advanced renal cell cancer would have to be screened to provide sufficient eligible patients for the confirmatory BrichD trial.

18.3.3 Analysis Once the trial is complete, a BrichD is no different with respect to the form of analysis than any other two-parallel group trial with the same type of endpoint. Nevertheless if, as we have indicated, since we wish to know whether T is better than S, a 1-sided test size is stipulated in the design, and so, a 1-sided p-value and a 1-sided CI of Equation (15.1) is required. A decision on whether further investigation of T is merited will depend on these values.

18.4

Biomarker-Stratified Designs

18.4.1 Design In the context of a known predictive biomarker for outcome, a RCT comparing T and S may be stratified by the absence and presence of such a marker. Thus, when there is no compelling prior evidence from biological or clinical data that patients who test B− for the predictive classifier will not benefit from T, a Biomarker-Stratified Design (BstraD) can be used. In this case, all potential subjects are established as either B+ or B− using the predictive marker of (as yet) uncertain utility in this context. Nevertheless, irrespective of their biomarker status, all consenting subjects are included in the BstraD of Figure 18.3.

18.4 BIOMARKER-STRATIFIED DESIGNS

423 R A Biomarker B+

N D O

Eligible Subjects

M

All classified before randomisation

R A Biomarker B–

Test (T )

Standard (S)

Test (T )

N D O M

Standard (S)

Figure 18.3 Randomised Biomarker-Stratified Design (BstraD)

Example 18.4 Biomarker-Stratified Design – acute myeloid leukaemia Although not described as a BstraD, in the trial of Stone, Mandrekar, Sanford, et al. (2017) 3277 patients, with acute myeloid leukaemia (AML), were first screened for the fms-related tyrosine kinase 3 gene (FLT3) mutation which was present in 896 (27%) individuals. Amongst these, 717 underwent randomisation to receive either Midostaurin (M) or Placebo (P) with chemotherapy in a BstraD stratified by three subtypes of FLT3 mutation: point mutation in the tyrosine kinase domain (TKD) (162 patients), or internal tandem duplication (ITD) with either a high allelic ratio greater than 0.7 (214), or a low ratio of 0.05–0.7 (341). The authors concluded, using a 1-sided statistical test (p-value = 0.009), that M was superior to P (their Figure 2a) in terms of overall survival. They also presented the overall HR = 0.78 and the HRs, with the associated 2-sided 95% CIs (their Figure 2b), which are suggestive of an advantage to M over P in the three biomarker groups.

18.4.2 Prognostic influence of the biomarker In the BstraD of Figure 18.3, the biomarker itself may or may not be predictive of outcome per se. Figure 18.4a shows a possible outcome scenario when the biomarker is not prognostic for outcome but T brings considerable advantage over S in terms of OS. In

424

18

(a)

(b) 1

1 Biomarker is Prognostic

Biomarker not Prognostic 0.8

0.8 Test effective whether in B– or B+ Overall Survival

Test effective whether in B– or B+ Overall Survival

GENOMIC TARGETS

0.6

0.4

0.6

0.4

Test

Standard

B– with Test

0.2

B+ with Standard

0.2 B–

B–

B+

B+ with Test

B+

B– with Standard

0

0 0

5

10

15

0

Time from randomisation (Months)

5

10

15

Time from randomisation (Months)

(c)

(d)

1

1 Biomarker is Prognostic

Biomarker not Prognostic 0.8

0.8 Test effective only in B+

Test effective only in B+ 0.6

0.6 B+ with Test

0.4

0.4

B– and B+ with Standard

0.2

B+ with Test

0.2

B– with Test 0

0 0

5

10

Time from randomisation (Months)

15

B– with Standard or Test 0

B+ with Standard

5

10

15

Time from randomisation (Months)

Figure 18.4 Schematic overall survival curves from a BstraD in which the biomarker is (a) not prognostic but T is effective in all patients, (b) prognostic and T is effective in all patients, (c) not prognostic but an advantage to T is confined only to those who are B+ and (d) prognostic but the advantage to T is confined to those who are B+

18.4 BIOMARKER-STRATIFIED DESIGNS

425

this situation, all future patients (whether B− or B+) would be prescribed T. Thus, if this circumstance is anticipated at the design stage, all patients irrespective of biomarker status would be entered into the planned trial. In contrast, Figure 18.4b illustrates situations where the biomarker is indeed prognostic and indicates the situation when T is superior to S whether amongst the B− or B+ patients. In this situation, future patients (whether B− or B+) would be prescribed T. Thus, if this circumstance is anticipated at the design stage, all patients irrespective of biomarker status would be entered into the planned trial. Figure 18.4c shows a situation where, once again, the marker is not prognostic but T is now only effective in those patients that are B+. Consequently, only future B+ patients would be recommended to receive T. Thus, if this situation is anticipated at the design stage, then B− patients would provide no useful information and so would not be recruited to the trial. However, Figure 18.4d indicates that although the biomarker is indeed prognostic, the superiority of T is confined to the B+ patients. Subsequently, only future B+ patients would be recommended to receive T. Again, if this situation is anticipated at the design stage, then B− patients would provide no useful information and so would not be recruited to the trial. 18.4.3 Analysis Although the sample size calculation requires specification of the trial design, for trials using BstraD, it is essential that an appropriate analysis plan is predefined clearly in the protocol for how the predictive classifier will be accounted for. These can be classified in the three ways of Figure 18.5. Although we will not discuss option (c) further, details are provided by Simon (2008, p. 5889) and Simon (2012, p. 3035)

(a) Anticipate T effective in all patients (b) Test B– contingent on significance in B+ Biomarker with strong credentials (c) Interaction Test

Figure 18.5 Strategies for analysis of a Biomarker-Stratified Design. Source: Adapted from Simon (2008); Simon (2012).

18.4.3.1 Anticipate T effective in B− and B+ patients If the BstraD anticipated the outcomes suggested by Figure 18.4a and b in which T is thought to be effective in both biomarker stratum, then an analysis strategy is summarised in Figure 18.6.

426

18

GENOMIC TARGETS

(a) ANALYSIS PLAN (i) Test the overall (combining B– and B+ patients) effect of T against S with a test size αOver = 0.03 (lower than the conventional α = 0.05). (ii) If the overall difference is significant, calculate the strata adjusted HROverall, the associated CI and p-value. (iii) If the overall difference is not significant (p-value > αOver) compare T and S within B+ only but at a test size αB+ = 0.05 – 0.03 = 0.02. (iv) Whether the test in Point (iii) is significant or not, calculate HR(B+), the associated CI and the p-value obtained.

Figure 18.6 Anticipate T effective in all (B− and B+) patients: strategy for analysis. Source: Adapted from Simon (2008) and Simon (2012).

Example 18.5 Biomarker-Stratified Design – breast cancer Although the influence of the prognostic marker plasma vascular endothelial growth factor (pVEGF-A) was considered a promising biomarker, its importance in patients with HER2-negative metastatic breast cancer had not been established. Thus Miles, Cameron, Bondarenko, et al. (2017) assessed the value in an RCT comparing Bevacizumab plus Paclitaxel (BP) with Placebo plus Paclitaxel (PlP) in these patients. The investigators do not report an analysis following the pattern of Figure 18.6. However, it is clear that the overall comparison (their Figure 2A) of BP versus PlP with respect to progression-free survival (PFS) indicates a better outcome for BP and is suggestive of a p-value much less than αOver = 0.03. They also quote an adjusted HR = 0.68, presumably stratified by pVEGF-A status and other covariates, which was significant with p-value = 0.0007. Since the overall comparison is statistically significant, then the analytical process would of Figure 18.6 ends at Point (ii). However, their Figure 2B also compares PFS of the BP and PlP groups confined to those with biomarker pVEGF-Ahigh with corresponding covariate adjusted HR = 0.64 and p-value = 0.0068. The comparison within the biomarker pVEGF-ALow is not reported.

18.4.3.2 Anticipate T effective in B− only if effective in B+ If the BstraD anticipated the outcomes suggested by Figure 18.4c and d where T is not anticipated to be effective in the B− patients unless it is effective in the B+ patients, then an analysis strategy is summarised in Figure 18.7. The purpose would not be to evaluate T overall, but only its effect within the B+ and B− marker groups individually.

18.4 BIOMARKER-STRATIFIED DESIGNS

427

(b) ANALYSIS PLAN (i) Compare T against S in B+ patients at α = 0.05. (ii) If the test in B+ is not significant, that is p-value > α, do not test in B–. (iii) Whether the test in Point (i) is significant or not, calculate HR(B+), the associated CI and p-value. (iv) However if the test in B+ is significant (p-value ≤ α), test in B– with the same test size α. (v) Whether the test in Point (iv) is significant or not, calculate HR(B–), the associated CI and p-value.

Figure 18.7 Anticipate T effective in B− only if shown effective in B+: strategy for analysis. Source: Adapted from Simon (2008); Simon (2012).

Example 18.6

Biomarker-Stratified Design – renal cell carcinoma

As the associated events occur earlier, time-to-progression (TTP) rather than overall survival (OS) was used as the primary endpoint in the trial conducted by Ravaud, Hawkins, Gardner, et al. (2008). They compared, L and H in all patients (their Figure 2A), as well as separately in those who are over expressed (EGFR3+: biomarker, B+) (Fig 2E) and in those who are not (EGFR0, 1, 2: B−) (Fig 2C). Suppose the trial design anticipated an effect within B− would only be expected if one was present in B+. Then, Figure 18.7 first stipulates a test within B+ (Fig 2E) which, for TTP gives HR(+) = 0.76, p-value = 0.063 with the latter greater than α = 0.05. Consequently, this suggests no test is conducted within the B− patients as activity within that group was not anticipated by the design. However, repeating this process for OS, their Figure 2F, for the B+ patients, gives HR(+) = 0.69 and p-value = 0.019 < 0.05. The subsequent analysis of Fig 2D, within the B−, gives HR = 1.11 with p-value = 0.56 and is clearly not significant! Thus, the analyses of TTP and OS are both suggestive of activity in the B+ but not the B− patients.

18.4.4 Trial size In general, the proportions of patients with a particular disease condition that will fall into the biomarker positive and negative groups are unlikely to be equal. Clearly, if there is a major imbalance between the two proportions, then this may bring problems with respect to the ability to recruit sufficient patient numbers into the smaller group. Further, if the biomarker concerned is (very) prognostic, then the corresponding event rates will differ markedly between the biomarker groups and this may raise problems for analysing S versus T at the same calendar time in the two biomarker groups. Figure 18.8 shows for a BstraD two examples of how HR(B+) is required to decrease (anticipated to be more and more advantageous to T over S) as p(B−) and/or HR(B−)

428

18 HRover = 0.7

HRover = 0.8

1

1

0.8

0.8

HR (B+)

0.7

0.8

0.7

0.6

GENOMIC TARGETS

HR(B–)

HR(B–) 0.6 0.85

0.75 0.4

0.4

0.9 0.2

0.8 0.9 1

0 0

0.2

0.4

0.6

0.8

1

0.95

0.2

1 0 0

0.2

p(B–)

0.4

0.6

0.8

1

p(B–)

Figure 18.8 Decreasing value of HR(B+) (implying increasing effect size) required to compensate for the increasing HR(B−) towards unity (implying decreasing effect size) as p(B−) patients increases in a BstraD for situations when HROver is 0.7 and 0.8

increase, in order to maintain an overall HR (HROver) at a fixed value. Thus, for example, to maintain HROver = 0.7 for p(B−) = 0.2 and HR(B−) = 1 (that is, T and S inactive in B−) then using Equation (18.4) a larger (more extreme value from the null hypothesis) effect size of HR(B+) = 0.640 would be implied. For p(B−) = 0.4 and HR(B−) = 1 a very extreme HR(B+) = 0.552 would be required. 18.4.4.1 Anticipate T effective in B− and B+ patients Figure 18.9 summarises the approach to the sample size calculation if it is anticipated that T will be effective in all patients. In this situation, a test size for the overall comparison is suggested as αOver = 0.03 and that for within the B+ subgroup αB+ = 0.02. We emphasise that B+ is the subgroup in which the T is thought to be of greater benefit. (a) TRIAL SIZE 1. Required to provide adequate power for the Overall test at αOver. = 0.03 2. Also required to provide adequate power for the B+ test at αB+. = 0.05 – 0.03 = 0.02

Figure 18.9 Anticipate T effective in all (B− and B+) patients: strategy for trial size determination. Source: Adapted from Simon (2008).

18.4 BIOMARKER-STRATIFIED DESIGNS

Example 18.7

429

Biomarker-Stratified Design – survival time endpoint

Suppose a randomised trial is planned with a survival time outcome comparing S and T and it is anticipated that T will be equally effective in both B− and B+ patients. Assuming a planning HRPlan = 2/3, then with π S = 0.5 Equation (18.3) implies a planning π T = exp(log0.5 × 0.6667) = 0.6299. Adopting the strategy of Figure 18.6 with a 2-sided test with αOver = 0.03 and 1-sided β = 0.2, then Equations (18.1) and (18.2) give the number of events to be observed as E = E(−) + E(+) = 227 from N = 522 recruited subjects comprising both biomarker B− and B+ individuals. In these calculations, no note of the proportion of patients in each biomarker status is taken. If the trial size is determined as above and at the close, the log-rank test comparing T with S in all patients leads to a pOver-value > 0.03 then, as this is not statistically significant, a test within only the B+ patients would be implemented. If this test, using a test size of αB+ = 0.02, were to be significant then, for example, if p(−) = p(+) = 0.5 and implying N+ = 522/2 = 261, a more extreme value than HRPlan = 0.6667 of HR(+) = 0.5470 would have to be present.

18.4.4.2 Anticipate T effective in B− only if effective in B+ Figure 8.10 summarises the approach to the sample size calculation if it is anticipated that T will only be effective in B− if it is effective in B+. In this situation, there are two components involved both with the (same) conventional test size, one for comparing the effect in B+ and one for the effect in B−.

(b) TRIAL SIZE 1. Required to provide adequate power for a test in B+ only at conventional α. 2. Also required to provide adequate power for a test in B– only at conventional α.

Figure 18.10 Anticipate T effective in B− only if shown effective in B+: strategy for trial size determination. Source: Adapted from Simon (2008).

Example 18.8 Biomarker-Stratified Design – survival time endpoint Using the same planning values of Example 18.7, namely HR = 0.6667, π S = 0.5 and π T = 0.6299 but adopting the strategy of Figure 18.7 but now with a 2-sided test of α = 0.05, and 1-sided β = 0.2, gives the number of events required from

430

18

GENOMIC TARGETS

Example 18.8 (Continued) Equation (18.1) as E(+) = 197. Equation (18.2) then gives N(+) = 454 subjects comprising only biomarker B+ individuals. The same number of B− patients, that is, N(−) = N(+), is also recruited in what might be thought essentially as a parallel trial of the same design as for B+ individuals. Thus, the planned total trial size anticipates E = 394 events from N = 908 patients. In the above calculations, no note is taken of the relative proportions of B− and B+ patients. However, the period required to recruit the necessary patients will depend critically on these proportions. For example, if p(B−) < p(B+) then the recruitment period for the B− group is likely to be longer to achieve the same target number as the B+ patients. Thus, once the design is implemented and the E(+) = 197 events within the B+ component have been observed, that component of the trial is closed, and an analysis of the associated data is conducted. If this is statistically significant, and the B− trial has not yet observed E(−) = 197 events, the B− component should continue to recruit until all 197 events have been observed when a second analysis should be undertaken. As previously indicated, such a (two-component) design implies a maximum of E = 394 events may need to be observed from N = 908 patients. However, had the B+ trial not been statistically significant then the B− trial, if still ongoing, would be closed immediately at which time fewer patients than the specified N(−) = 908/ 2 = 454 patients might have been recruited.

18.4.5 Practical considerations In general, there will be rather complex situations to consider when designing these trials. These will depend on the views of the investigating team with respect to whether or not T will be expected to bring benefit to one or both biomarker groups, the strength of the prognostic influence of the biomarker and the relative proportions of the respective markers amongst the eligible patients it is planned to recruit. Clearly, if either p(B−) or p(B+) is very small, then use of some of the designs would not be considered. Suppose a BstraD is planned with HROver = 0.7 with 2-sided α = 0.03 and β = 0.2, and the first step is to use Equation (18.1) to suggest EOver = 292 events are required. Setting π S = 0.5 in (18.3) gives π T = exp(log0.5 × 0.7) = 0.6156 which, with use of (18.2), then gives the number of patients to recruit as NOver = 662. If, as in Figure 18.4c and d, T is known to be only effective in B+, then B− patients would not be recruited to the trial. Thus, the 292 events would be required from 662 B+ patients. On the other hand, if it is anticipated that T may be effective overall but less so in B− than in B+ then, for a given HROver, this implies HR(−) > HR(+). However, if B− is adversely prognostic, then events from patients receiving S in this biomarker group will accrue at a greater rate than S patients who are B+.

18.5 ADAPTIVE THRESHOLD DESIGNS

431

Suppose that HR(−) = k × HR(+), then with HROver = 0.7, k = 1.1 and p(B−) = 0.2, Equation (18.4) leads to HR(+) = 0.6868 and HR(−) = 0.7555 with corresponding E(+) = 264 and E(−) = 468. Thus, in total, E = E(−) + E(+) = 732 events. Whilst if k = 1.25, HR(+) = 0.6694 and HR(−) = 0.8368 with E(+) = 232, E(−) = 1150, and E = 1382. In these two, and similar scenarios, if separate trials within the two biomarker groups were to be conducted, then the B+ trial would be smaller in size than the overall design size as HR(+) is smaller (further from the null, HR0 = 1) than HROver. The size of B− group trial would be larger as HR(−) is larger than HROver and therefore closer to the null than HROver. In such a case, a practical strategy might be to assume k = 1.1, commence recruitment to B− and B+ patients until O(+) = EOver = 292 events are observed just within B+. At that stage, this will (hopefully) answer the B+ question. If B− is very adversely prognostic, and hence experiences a correspondingly greater event rate than B+, then the accumulated events in this group, O(−), may be sufficient to indicate the relative merit of T over S within that biomarker group. However, the greater event rate may be offset by the group comprising relatively fewer patients if, for example, p(−) = 0.2. In which case, the B− arm of the trial may have to continue recruitment.

18.5

Adaptive threshold designs

18.5.1 Design Although previous designs have assumed that a biomarker targeted for a particular drug is available whose biology is well understood, there will be many situations in which the biology of the potential target is insufficiently understood at the time that a RCT is initiated. For these circumstances, Jiang, Friedlin and Simon (2007) propose an adaptive threshold design (ATD). In this design, a single predictive marker, with the potential to define patient groups into those who will, and those who will not, benefit from the T under investigation, is available but no threshold to define B− from B+ is predefined. To do this, it is required that either the biomarker is a continuous variable and so can be dichotomised at any value or a categorical variable with (preferably) several categories to allow dichotomisation between any chosen level. When implementing the ATD, specimens are collected from all patients at trial entry but, since the predictive test is not yet established, the results from the samples will not modify patient eligibility criteria. Thus, irrespective of the biomarker outcome, all patients are randomised to either S or T.

Example 18.9 Adaptive threshold design – unconfirmed predictive classifier As we indicated in Example 18.1, the EGFR expression of eligible renal cell carcinoma patients entered into the trial conducted by Ravuad, Hawkins, Gardner, et al.

432

18

GENOMIC TARGETS

Example 18.9 (Continued) (2008) was determined so that two biomarker groups could be established. Suppose a RCT is to be conducted, the results from which will determine how the biomarker groups are to be determined. In this example, since EGFR is recorded in four ordered categories from 0, 1+, 2+, 3+; three possible dichotomies (0 versus 1+, 2+, 3+), (0, 1+ versus 2+, 3+) and (0, 1+, 2+ versus 3+) are available to investigate in order to determine the cut for B− versus B+. Although not the objective of the trial, of the 416 patients in which EGFR was assessed 57.9% were EGFR3+ so a practical dichotomy (as used by the investigators) in this case would be (0, 1+, 2+ versus 3+) although this need not imply this is the most discriminatory cut with this measure. However, it would establish biomarker groups of similar sizes.

18.5.2 Analysis The analysis of an ATD trial is conducted step-by-step in stages, the number of which is dependent on the outcome of each stage. The process begins by comparing the outcomes for all patients receiving T with all S patients. If this difference is significant at a prespecified significance level, α1 < αOver, T is considered effective for the eligible population as a whole and the analysis terminates without investigation of the value of the (potential) classifier. If the first test is not significant, that is the p-value > α1, then it is presumed that the classifier may be important. This result instigates an investigation to find a cut-point to define ‘good’ (those in which T brings benefit over S) and ‘poor’ (those in which T brings no benefit over S) patients. The second stage tests involve finding the cut-point b∗ for which the difference in outcome of T versus S (the treatment effect) is maximized when the comparison is restricted to patients with predictive test scores above that cut-point. The second stage test is performed using a smaller significance threshold of α2 = αOver − α1. If significant, then T is considered effective for the subset of patients with a biomarker value above the cut-point at which the maximum treatment effect occurred. 18.5.3 Trial size The principle aim of the trial is to compare S and T, and so, the anticipated difference between them will form the basis of the effect size (the HR) chosen for the design. Then, with a prespecified test size, α, and power 1 − β, the sample size can be determined. However, the steps indicated in the approach to analysis described above may include multiple stages. This includes the overall test and further tests for determining the ‘best’ marker dichotomy. Consequently, discussion of an appropriate value for α may suggest that a higher level of significance (that is a lower threshold set for statistical significance)

18.5 ADAPTIVE THRESHOLD DESIGNS

433

should be set. Thus, αOver = 0.1 or 0.2 rather than the usual 0.05 might be chosen and so a relatively smaller sample size would result.

Example 18.10

Adaptive threshold design – non-small-cell lung cancer

Kim, Herbst, Wistabo, et al. (2011) describe the results of a completed prospective, biopsy-mandated, biomarker-based, ATD trial in 255 pretreated chemorefractory non-small-cell lung cancer patients. They state: The initial accrual goal for this trial was 250 randomised patients to achieve a sample size of 200 evaluable patients with complete marker profiles, which would allow an 80% power, with a 20% type I error rate, to identify effective treatments within each marker group. A high type I error rate prevented missing any potentially effective treatments that could be confirmed in larger, future studies …

The first 97 patients were randomised on a 1 : 1 : 1 : 1 to Erlotinib (E), Vandetanib (V), Erlotinib plus Bexarotene (EB) or Sorafenib (So). Following this, the subsequent 158 patients recruited were switched to a (so-called) Bayesian adaptive randomisation which bases the allocation to treatment on accumulating data within the trial and thereby allowing a greater proportion to receive what appears to be the more effective therapies. Thus, as shown in Figure 18.11, the ratios actually randomised to treatment were broadly similar in the first stage, but very disparate in the adaptive stage with almost half randomised to So and only 10% to EB for example. Amongst the 5 biomarker groups utilised with this design, the most favourable subgroup appears to be marker KRAS/BRAF receiving So with a disease-free control rate of 79%.

E

Treatment Allocated V EB

So

Total 97

Equal randomisation stage Number of patients Ratio randomised

25

23

21

28

0.258

0.237

0.216

0.289

Adaptive randomisation stage Number of patients Ratio randomised

34

31

16

77

0.215

0.196

0.101

0.487

158

One of the five biomarker groups response rate in evaluable patients KRAS/BRAF

1/7 (14%)

0/3 (0%)

1/3 (33%)

11/14 (79%) 13/27 (48%)

Figure 18.11 Number of chemorefractory non-small-cell lung cancer patients recruited to the BATTLE trial for the equal and adaptive randomisation stages by treatment allocated, together with the response to KRAS/BRAF – one of the five biomarker groups studied. Source: Based on Kim, Herbst, Wistuba, et al. (2011).

434

18

GENOMIC TARGETS

18.5.4 Practicalities In general terms, any RCT contemplated of whatever size and complexity requires careful planning and usually extensive logistical support in order to be completed successfully. In the context of this chapter, the entire process may be further complicated by requiring knowledge of the appropriate biomarkers, their timely assessment during the trial process and the ability to act on the consequences of such information. Thus, there must be very close links between the clinical, laboratory and data collection teams so that decisions, for example to randomise using an adaptive approach, can be implemented. Since adaptive randomisation requires knowledge of the ‘response’ of earlier recruited patients for it to be activated, this requires ‘fast-track’ knowledge of their response to be available at the randomisation centre. Apart from the complexities of requiring biomarker or obtaining patient response information during the conduct of the trial, there are the statistical complexities concerned with, for example, estimating the sample size required, the logistics of setting up the processes to facilitate the (possible) adaptive randomisation and details of interim looks at the data. One such issue may be when to switch from equal randomisation with the first patients to adaptive randomisation at the later stage. A further concern is the ‘black-box’ nature of the statistical planning processes. Some of the associated statistical methodology is complicated and far from easy to explain to the clinical teams concerned. This makes difficulty for the writing team to satisfy a CONSORT requirement which stipulates that sample size justification needs to be described clearly in the subsequent publication. These concerns essentially restrict the use of ATD and other designs of this chapter to well established and extremely sophisticated clinical trial groups.

CHAPTER 19

Feasibility and Pilot Studies

Conducting randomised clinical trials (RCT) of whatever design and complexity will always involve the investigating teams planning them very carefully. In general, the preliminary work that needs to be done before recruiting the first subject is considerable. Among the many considerations, the practicability of delivering the interventions and the determination of the appropriate number of subjects required are fundamental. In broad terms, a feasibility study is concerned with practicability and pilot studies with subject numbers although the terms are often regarded as interchangeable. This chapter describes the general nature of feasibility studies and gives some examples of their role. One type of pilot study, the External-Pilot study provides (usually statistical) information upon which a definitive (Main) trial will be designed. In contrast, an Internal-Pilot is designed mainly to enable a reassessment of the sample size within an ongoing Main randomised trial. Issues of determining the size of a Pilot study are discussed and, in the case of an External-Pilot, methods of determining the optimum total (External Pilot plus Main) trial size are included.

19.1

Introduction

As a preliminary to designing a RCT, there will usually be a protracted period in which the research question is identified, associated work reviewed, preliminary options for the chosen design investigated, sample sizes determined, the protocol is written and reviewed before the definitive (we term this Main) trial can commence. If, at this stage, information is sparse on one or more of the planning features then it may be appropriate to use this planning interval to firm up on some of these aspects. This might include conducting a preliminary investigation to refine details on, for example, the precise nature of the interventions to be compared or be concerned with more mundane but important logistical issues. Such a study is generally termed a ‘feasibility study’ which is described by Eldridge, Lancaster, Campbell, et al. (2016) as an over-arching

Randomised Clinical Trials: Design, Practice and Reporting, Second Edition. David Machin, Peter M. Fayers, and Bee Choo Tai. © 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.

436

19 FEASIBILITY AND PILOT STUDIES

concept which may concern refining qualitative and quantitative aspects pertinent to design and trial operation. The term, ‘pilot study,’ comes under the same ‘feasibility’ umbrella. For the purposes of this chapter, we will regard feasibility studies as being confined to issues concerned with logistical aspects associated with trial planning and eventual conduct whereas Pilot studies are concerned with issues relating to determining sample size in some way. In practice, both Feasibility and Pilot studies, or a single study of both types, may be conducted during the processes involved in developing and conducting a particular Main trial. Consequently, characterising a particular preliminary study as feasibility or pilot is not always straightforward as will be clear for some of the examples of this chapter.

19.2

Feasibility studies

In circumstances where there is a main trial being planned, and there is extensive experience both of the dynamics of the participant investigation groups and of the peculiarities of the therapeutic area, a formal feasibility study may not be necessary. However, in most situations, new issues or challenges are likely to arise which will be examined during the planning process so that some aspects concerned with feasibility often need to be evaluated. Feasibility studies are those in which investigators attempt to answer questions about whether some element of the future Main trial can be done but do not necessarily implement the interventions to be evaluated or other processes to be undertaken in a future trial. Examples of feasibility studies might include reviews of patient records to ascertain levels of patient availability or interviews with clinicians to determine their willingness to contribute patients to a proposed trial.

Example 19.1 Feasibility – patient informed consent Sosnowski, Mitchell, White, et al. (2018) report a feasibility study concerned with assessing the quality of life in intensive care unit survivors. Their study design randomised thirty adult mechanically ventilated participants recruited from a single centre equally to receive either the ABCDE bundle (Test, T) or standard routine management (Standard, S). The ABCDE bundle is a strategy developed to assist in the implementation of guidelines with respect to managing pain, agitation and delirium in such patients. One feasibility concern of the investigators was regarding whether a sufficient patient informed consent rate could be anticipated for the planned Main trial. In the event, they established a satisfactory consent rate of 81% from 37 eligible patients. As a consequence of this study showing that many subjects experienced delirium during the ventilation phase, a scripted regime to prevent delirium was required for the Main trial.

19.3 EXTERNAL-PILOT STUDIES

437

Example 19.2 Feasibility – refining the interventions One aspect of the (so-called) feasibility study of Vigneault, Morton, Parulekar, et al. (2018) was to examine the quality assurance process that assesses compliance with protocol-specified radiotherapy treatment parameters

in patients with prostate cancer receiving either image-guided radiotherapy (IGRT) alone or with the addition of a high dose rate (HDR) brachytherapy boost. The investigators reported: For the 29 patients receiving IGRT treatment, 14 cases were reported with minor deviation and three with major deviation. For patients on IGRT + HDR, 18 cases were reported with minor deviation and two with major deviation. Both of these major deviations occurred in the IGR component ….

However, the investigators demonstrated the feasibility of the randomisation between the treatments, good compliance and adherence to quality assurance standards.

19.3

External-pilot studies

19.3.1 Aims In broad terms, these are intended to inform the investigators on sample size issues in relation to the Main RCT. In our context, Main refers to the trial that is planned whose objective is to answer the research questions posed by the investigating team. The purpose of an External-Pilot study is to influence the planning before the Main trial is initiated with the primary objective to refine, for example, the anticipated response rate with S, the proposed effect size or the standard deviation (SD) with the view to better inform the sample size calculation process. At the same time, another pilot issue may be the patients’ willingness to participate in the proposed Main trial while feasibility aspects may reflect concerns with respect to reviewing the procedures and patient acceptability of the proposed T intervention. 19.3.2 Single-arm design As single-arm implies, these are not comparative studies but may, in relevant circumstances examine both S and T options in parallel but in a nonrandomised structure. In some situations, it may be relevant to examine T alone. In others, a single-arm pilot study of S may help to establish a reasonable estimate of (say) the recruitment rate that

438

19 FEASIBILITY AND PILOT STUDIES

can be achieved or to obtain a more reliable estimate of the SD of the intended outcome measure so that a more credible estimate of participant numbers for the Main trial can be obtained.

Example 19.3 Single-group external-pilot – varicose veins Dwyer, Baranowski, Mayer and Gabriele (2018) describe a single-arm pilot study … to determine the feasibility of recruitment and data collection …

in patients with chronic venous disease. The affected limbs were treated with LivRelief varicose veins cream, a Natural Health product. The authors enrolled 32 subjects and concluded: Recruitment and data collection targets of at least 70% were established ….

Although not immediately related to the eventual sample size required, the rate of recruitment provides important information with respect to the anticipated duration of the Main trial and thereby indicate its practicability. The authors also commented that the cream could improve some clinical symptoms and that a randomised placebo-controlled follow-up trial was warranted.

19.3.3 Randomised design Although an External-Pilot randomised trial is one which mimics (although on a smaller scale) the design of the Main trial, it does not aim to establish the superiority or otherwise of T over S interventions but rather to try out aspects of the proposed Main trial. Neither are the data from the External-Pilot combined with the eventual Main trial. Randomised pilot studies include those that, for the most part, reflect the design of a future Main trial but, if necessary due to remaining uncertainty, may involve trying out alternative strategies, for example, collecting an outcome variable via telephone for some participants and online for others. Although a 1 : 1 randomisation is usual this may not be necessarily envisaged for the eventual Main trial. Neither is information from the External-Pilot expected to lead to dramatic changes being made to the design features of the Main trial being planned. It is desirable that External-Pilot includes participants who are representative of those intended for the Main trial to follow. However, it may happen that pilots are conducted in specialist centres, where the participants are tightly controlled and so less variation is present than in a more general medical setting anticipated for the Main trial. In which case, this may cause some difficulty in assessing appropriate values for the component parameters determining the anticipated effect size for the Main trial. Clearly, the Main trial will commence sometime after the completion of the External-Pilot.

19.3 EXTERNAL-PILOT STUDIES

439

Example 19.4 Determine the SD – major depressive disorder The randomised study reported by Sugg, Richards and Frost (2018) compared Treatment as Usual (TAU) with Morito (M) therapy with TAU (M-TAU) in 68 participants with depression. The M component is a Japanese psychotherapy informed by Zen Buddhist principles whereas TAU is that which is given by the patient’s General Practitioner. The main outcome measure evaluated was the severity of depressive symptoms (PHQ-9) scale. The mean (SD) PHQ-9 scores at 4 months post-randomisation were 12.4 (5.7) and 8.4 (6.5) for TAU and M-TAU based on 33 and 34 participants respectively giving a pooled estimate of SDPool = 6.1 units.

19.3.4 Sample size 19.3.4.1 General considerations As we have discussed in Chapter 9, when establishing the number of subjects to recruit in a RCT designed to detect superiority with (say) a continuous outcome measure, a range of options for the anticipated difference, δPlan, between the S and T arms will be considered. In addition, the extent of the anticipated variability between subjects within each intervention group needs to be specified, by an SD, σ Plan. Once these two components are identified, together they determine the standardised effect size: ΔPlan =

δPlan σ Plan

(19.1)

Nevertheless, it is well recognised that in general there will be a degree of uncertainty with respect to the components of ΔPlan and this will be reflected in the uncertainty with respect to the sample size, N, chosen for the Main trial under consideration. In general, if endpoints are binary, ordered categorical or time-to-event variables rather than continuous, the corresponding RCTs will tend to be larger for the same standardised effect sizes. Consequently, the rules indicated below are likely to provide lower limits to the participant numbers recommended for External-pilot studies in these situations. 19.3.4.2 Flat rule-of-thumb In the very preliminary stages of planning Main trials, little may be known of the actual value of the SD to be used for σPlan. Thus, with this uncertainty in mind, sample size possibilities for a two equal-group comparative External-Pilot study, NPilot, have been suggested to provide information on which to estimate of σPlan. These are summarised

440

19 FEASIBILITY AND PILOT STUDIES

The total trial sizes here assume that full pertinent information will be obtained from all the subjects recruited Reference

NPilot

Birkett and Day (1994)

20

Julious (2005)

24

Kieser and Wassmer (1996)

20 to 40

Sim and Lewis (2011)

≥56

Browne (1995)

60*

Teare, Dimairo, Shephard, et al (2014)

≥70

*Actually recommended 30 for a single-arm study.

Figure 19.1 Flat rule-of-thumb for the total sample size of a randomised two equal-group comparative External-Pilot trial. Source: Based on Whitehead, Julious, Cooper, et al. (2015).

in Figure 19.1 and indicate a wide range of values from 20 to more than 70 subjects. The term ‘Flat’ indicates that these suggestions take no account of the ultimate standardised effect size, ΔPlan, of Equation (19.1). Thus, the External-Pilot sample size is fixed no matter how large the subsequent Main trial is intended to be. 19.3.5 Stepped rule-of-thumb If the planning team know something of the magnitude of the standardised effect size, ΔPlan, intended for the Main trial then they can use this knowledge in determining the planned External-Pilot sample size. Thus, Figure 19.2 suggests a range of total sample sizes for an External-Pilot, based on whether (at this preliminary stage) the initial Δ for the Main trial is thought likely to be in one of the four regions: 0,

(19.3)

444

19 FEASIBILITY AND PILOT STUDIES

where θ = TINV 1 − β, df Pilot , t 1 − α

2,N Main − 2

(19.4)

represents the inverse function of the noncentral Student’s t-distribution. The usual, or central, t-distribution describes how a test statistic t is distributed when the null hypothesis is true, the noncentral distribution describes how t is distributed when the null hypothesis is false. The expression (19.4) takes account of dfPilot, while t 1 − α 2,N Main − 2 takes values from Table T4 of Student’s t-distribution. However, to calculate the sample size, there is a difficulty in that NMain on the left of (19.3) is also a component of θ, on the right side of this expression. Thus, to find a solution for the sample size, NMain, an iterative process is required. In practice, a computer program is required to implement this iteration. Table T7, an extension of Julious and Owen (2006, Table 1), gives some examples of Main sample sizes (assuming randomised patient numbers are equal in the two groups) arising from the use of Equations (19.3) and (19.4) for a range of standardised effect sizes and the associated degrees of freedom for estimating sPilot.

Example 19.8 Non-central t-distribution – major depressive disorder The External-Pilot of Example 19.4 suggested sTAU = 5.7 based on a relatively small sample size and so is determined with some degree of uncertainty. Suppose this had been based on very few subjects with dfPilot = 10. If a planned Main trial is to detect a minimum clinically important difference of δPlan = 2.8, then the standardised effect size is 2.8/5.7 or about 0.5. Then, with φ = 1, 2-sided 5% significance level, 90% power, and df = 10, Table T7 suggests NMain = 222. Had this been based on dfPilot = 20, then NMain = 194 – a reduction in size of 28 subjects. However, once df > 50 there is little change in the values of NMain as df further increases.

19.4

Considerations across external-pilot and main trial

If one of the adjustment methods described above is used as a basis for using the estimate of the SD provided by the External-Pilot then, as we have illustrated, the size of that pilot ultimately affects the sample size of the Main trial. That is, since the adjustment methods depend on the df with which sPilot is estimated so ultimately does the size of the Main trial. Thus, there is a trade-off between having a small External-Pilot and hence large κ, and the likelihood of a larger Main trial, or a larger External-Pilot (smaller κ) and a likely smaller Main trial. 19.4.1 Optimising overall sample size Here we think of the External-Pilot and the Main trial as comprising one Overall trial programme. This leads to the idea of minimising the sum of the sample sizes from the

19.5 INTERNAL-PILOT STUDIES

445

External-Pilot and the Main trial together to produce an optimal ‘Overall’ study sample size. Once the required Main trial size, NMain, assuming equal numbers per group has been determined, the Overall study sample size then becomes N Overall = N Pilot + N Main

(19.5)

Table T8 gives optimal sample sizes of the External-Pilot, NPilot, the Main trial, NMain and their collective total, NOverall, for the NCT approach for the calculations.

Example 19.9 External-pilot and main trial A two-arm parallel group (Main) RCT is being planned with a 2-sided α = 0.05 and a power, 1 − β = 0.9. The primary outcome is a continuous variable with an approximate Normal distribution. As the investigators are unsure about design aspects of the Main trial such as what the SD of the outcome measure is anticipated to be, the likely recruitment and dropout rates, they decide to run an External-Pilot trial. Suppose the standardised effect size to be used in the Main trial, which is the minimum regarded as clinically worthwhile, is of ‘medium’ size at ΔPlan = 0.4. Then, for a Main trial power of 90%, the stepped rule-of-thumb sample size option for a comparative External-Pilot suggested by Figure 19.2 is NPilot = 30 patients. Then, if the optimal approach of Table T8 is chosen, the size of the External Pilot remains at NPilot = 30, the associated Main trial size NMain = 290, and so an Overall sample size, External-Pilot trial plus Main trial, of NOverall = 320. The corresponding values for ΔPlan = 0.3 are as follows: NPilot = 38, NMain = 504 with NOverall = 542. Whereas for ΔPlan = 0.7 they are: NPilot = 20, NMain = 100 with NOverall = 120.

19.5

Internal-pilot studies

As we have pointed out, an External-Pilot study is conducted before the Main trial becomes open for recruitment and the data obtained from the Pilot do not contribute to an efficacy evaluation of the interventions concerned. However, when the design of the Main trial is settled upon, and the trial started, then early participants could form part of an Internal-Pilot. In which case, the assumption is that at some (early) stage during the course of recruitment to the Main trial a specific review is conducted as to whether or not the agreed trial size is sufficient. The Internal-Pilot provides an insurance against misjudgement regarding the baseline planning assumptions which may have compromised the purpose intended. It is, nevertheless, important that the intention to conduct an Internal-Pilot is recorded at the outset with full details provided in the trial protocol and the eventual published report.

446

19 FEASIBILITY AND PILOT STUDIES

19.5.1 Continuous outcomes In the case of a continuous outcome measure, an Internal-Pilot reviews the initial sample size, NInitial, of the Main trial which had been estimated using σ Initial as the anticipated SD. Thus, the Internal-Pilot takes the endpoint data recorded from first nInternal ( Control



Yes



PFS and OS

Test 3



Initially PFS

A N D O M I S E

Randomisation commences in Stage II

Figure 20.5 A rolling randomised trial commencing with a Standard and two test arms in the initial stage, one test dropping after Stage I accrual and a new test arm replacing this in Stage II.

larger trial situations. Thus Parmar, Barthel, Sydes, et al. (2008) have explored the possibility of designs for randomised trials with several arms in which, as information accumulates, a rolling process of dropping one or more of these and (possibly) replacing them by new options is initiated. An example of this type of design is given in Figure 20.5 in which patients in Stage I are randomised equally to three options, one of which is regarded as the standard or control. Sometime later, a comparison between these arms is made using an intermediate endpoint, in this case progression-free survival (PFS), rather than using the endpoint of primary concern which is overall survival (OS). At this point, Test 2 appears to be doing better than Standard with respect to PFS. In contrast, Test 1 appears to be doing no better than Standard and is consequently dropped from the design. This allows a new option (Test 3) to be introduced. Stage II then continues with equal randomisation between Standard, Test 2 and Test 3.

Example 20.5 Prostate cancer James, Sydes, Clarke, et al. (2008) describe an ongoing trial using ideas encapsulated in Figure 20.5 but involving six options for systemic therapy for patients with advancing or metastatic prostate cancer. The options are: (i) Hormone therapy

20.3 LARGE SIMPLE TRIALS

461

Example 20.5 (Continued) (HT) alone, (ii) HT + Zoledronic acid (Zo), (iii) HT + Docetaxel (Do), (iv) HT + Celecoxib (Ce), (v) HT + Zo + Do and (vi) HT + Zo + Ce. They point out that the design, approval process, launch and recruitment are all major challenges. The report of the subsequent trial of James, Sydes, Clarke, et al. (2016) is confined to only four of the six options namely (i), (ii), (iii) and (v). An important feature of the eventual design was that randomisation favoured HT alone (the Standard arm) in a 2 : 1 : 1 : 1 ratio as the results of each of the other arms was to be contrasted with this group. The authors state that: These are the first mature, comparative, randomised data to emerge from the trial.

They conclude that Zoledronic acid … should not be part of the standard of care for this population and Docetaxel treatment should become part of standard care ….

Korn and Freidlin (2017) recognise that there is a wide range of adaptive elements that can be incorporated into clinical trial design, discuss advantages and disadvantages of such features and highlight issues arising from actual trials that have been completed.

20.3

Large simple trials

In some situations, investigators may be concerned with questions that have considerable public health impact even if the advantage demonstrated in one intervention over the other is numerically small. This is particularly relevant in the fields of cardiovascular disease and the more common types of cancer, where even a small increase in cure or survival rates will bring major benefits to many patients. As we have seen in terms of trial size the smaller the potential benefit and hence the effect size, then the larger the trial must be in order to be reasonably confident that the benefit envisaged really exists at all. To be specific and assuming a continuous outcome, with a 1 : 1 allocation ratio, 2sided α = 0.05 and 1 − β = 0.9, then a decrease of ΔCohen in Equation (9.5) from 0.2 to 0.1 increases the required trial size from approximately 1000 to more than 4000. One very extreme example of ‘a large trial’ is that of Chen, Pan, Chen, et al. (2005) which included 45 852 patients with acute myocardial infarction who were randomised to receive, in addition to standard interventions, either metoprolol or matching placebo. Trials involving many thousands of patients to estimate a small benefit reliably are a major undertaking. To be justified they must be practical. Hence, they have to be in common diseases or conditions in order for the required numbers of participants to be available in a reasonable time frame. Such trials must be testing a treatment or intervention that not only has wide applicability but can also be easily administered by the clinical teams responsible or even better by the subjects themselves. In most instances, any treatment under test must be readily available across a wide range of health care systems and this tends to imply they need to be of low cost. This is especially the case if the trial

462

20 FURTHER TOPICS

demonstrates a clinically useful benefit as then putting these results into actual practice will also have future cost implications. The treatments must be relatively non-toxic else the small benefit might be outweighed by the severity of adverse side effects. There will be few circumstances where such trials cannot involve multicentre recruitment, possibly on an international scale. This means that the design team will need to consult a wide range of collaborating teams, and obtaining consensus on the final design may not be easy. The responsible trial office will need to be prepared for the organisational consequences. These factors all suggest that these trials should be ‘simple’ trials – implying minimal imposition on the recruiting centres with respect to the data they record on each subject (albeit on many individuals) which in turn reduces the trial office work to the minimum required whilst retaining sufficient information to answer the questions posed.

Example 20.6 Chronic heart failure In the CHARM-Added trial of McMurray, Östergren, Swedberg, et al. (2003), patients with chronic heart failure (CHF) who were being treated with ACE inhibitors were randomised to either placebo (P) or candesartan (C). The primary outcome was a composite event of the first of (i) unplanned admission to hospital for the management of worsening CHF or (ii) time to cardiovascular death. The authors state: The planned sample size of 2300 patients was designed to provide around 80% power to detect a 16% relative reduction in the primary outcome, assuming an annual placebo event rate of 18%.

In fact, 2548 patients were enrolled, 1272 received P and 1276 C. However, the report of the trial which recruited patients from Canada, Sweden, USA and the UK was based on 538 and 483 events within the P and C groups respectively – far fewer than the number of patients randomised. The above disparity is typical of trials with time-to-event endpoints, as the number to recruit derived from Equation (9.13) is effectively the number needed to recruit ‘in order to observe the required number of events’ as calculated from Equation (9.12). The results of the trial are summarised in Figure 20.6 and this shows a lower event rate in those patients receiving C with HR = 0.85, 95% CI 0.75–0.96, pvalue = 0.011. This remains essentially unchanged after adjusting for prognostic factors such as heart disease risk factors, medical history and medical treatment prior to randomisation which gives HR = 0.85 with p-value = 0.010. The observed annual event rates with C and P were 14.1 and 16.6% respectively, a difference of 2.5%. Despite this numerically small difference the authors concluded that: The addition of candesartan … leads to a further clinically important reduction in relevant cardiovascular events ….

Needless to say, such large (although simple) trials are a considerable undertaking and, if nothing else, require a very experienced team with substantial resources.

20.4 BAYESIAN METHODS

463

Proportion with cardiovascular death or hospital admission for CHF (%)

50

40 Placebo 30 Candesartan 20 Hazard ratio 0.85 (95% CI 0.75–0.96), p = 0.011 Adjusted hazard ratio 0.85, p = 0.010

10

0 0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

948 906

457 422

Time (years) Number at risk Candesartan Placebo

1276 1272

1176 1136

1063 1013

Figure 20.6 Cumulative Kaplan–Meier curves for the event, cardiovascular death or hospital admission, from the CHARM-Added trial. Source: McMurray, Östergren, Swedberg, et al. (2003). © 2003 Elsevier

20.4

Bayesian methods

The essence of Bayesian methodology in the context of randomised clinical trials is to incorporate relevant external information concerning the question under consideration into the planning, monitoring, analysis and interpretation processes. The object of the approach is not solely to improve the trial design but also to facilitate decisions with respect to whether or not the trial conclusions should be adopted into routine clinical practice. This all coincides with the opinions of Clarke, Hopewell and Chalmers (2007) who state that clinical trials should begin and end with up-to-date systematic reviews of other relevant evidence. From a Bayesian perspective this can begin with summarising the evidence at the planning stage of the trial, which may then impact on the final design chosen, to summarising accumulating external evidence as the trial is recruiting and until such time as it closes for analysis and then putting all of this alongside the trial evidence to assist in the interpretation. 20.4.1 Mechanics In broad terms, the usual (or frequentist) approach to the estimation of a parameter such as βTreat of Equation (2.1) is to regard the parameter as a fixed value for the population concerned. Then, once the data are collected from the trial, this is estimated by

464

20 FURTHER TOPICS

bData. The corresponding standard error (SEData) gives a measure of the precision with which βTreat is estimated entirely from the internal evidence of the trial data itself. The distribution of this estimate is assumed to have a Normal form and this is termed the likelihood distribution. The Bayesian approach seeks to review the relevant external evidence pertinent to the trial question, this may range from very little to quite extensive information, and this is used to estimate βTreat also. The information is summarised by bExternal, which is also assumed to have a Normal distribution form with standard error SEExternal. This is termed the prior distribution. Since both the Internal and External information are concerned with the same question, it seems natural to combine the two. This could be done by taking a simple average of bData and bExternal but this ignores the relative precision of the two estimates. For example, if bData is based on data from a very large randomised trial whilst the external information is very sparse, it would seem foolish to ignore this fact when calculating their average. The combination of an Internal (objective) summary with an External (possibly subjective) summary is termed a Bayesian synthesis. Thus, one means of taking the relative precision of the Internal and External evidence into account is to estimate βTreat by the weighted mean bBayes =

W Data bData + W External bExternal , W Data + W External

(20.4)

where W Data = 1 SE2Data and W External = 1 SE2External . As both bData and bExternal are assumed to have a Normal distribution form so also will bBayes but with standard error SE bBayes =

W Data

2 + W External

(20.5)

Example 20.7 Prior, likelihood and posterior distributions Tan, Wee, Wong and Machin (2008, Table 5) describe an example of the prior, likelihood and posterior distributions in situations similar to Figure 20.7 which might arise if two treatments were being compared with a time-to-event endpoint. In this case, one is interested in estimating the hazard ratio (HR) and log HR is assumed to have a Normal distribution. No difference between treatments corresponds to HR = 1 (or equivalently log HR = 0), whilst an advantage to (say) the Test (T) over the Standard (S) treatment will result in HR < 1 or log HR < 0. For the likelihood, the SEData(log HR) depends critically on the number of deaths observed within the trial. Similarly, for the prior distribution, SEPrior(log HR) also depends on the number of ‘weighted deaths’ in the external information. In the illustration of Figure 20.7, the prior distribution has a mean HR = 0.42 which is quite different from the null hypothesis value of HR = 1 (log HR = 0) and therefore suggests that the external information points to a substantial benefit to T. Also, the prior has a relatively sharp peak indicating a small SE and hence a

20.4 BAYESIAN METHODS

465

Example 20.7 (Continued) relatively large number of ‘weighted deaths’ associated with it. However, the trial once complete, and the data summarised, gives a likelihood distribution centred at HR = 0.57 which is closer to HR = 1 but with a very small area beyond to the right. This too indicates a benefit to T but not as large as the prior distribution had suggested. Also, the likelihood is not so peaked indicating a relatively large SE and hence fewer ‘real’ deaths than ‘weighted deaths’. Finally, the posterior distribution, formed from the combination of the prior and likelihood distributions, takes a central position at HR = 0.47 and is more peaked in shape.

Posterior 3

2 Prior Likelihood

1

0 0.2 Favours Test

0.3

0.6

0.47 HR

0.75

1

1.5 Favours Standard

Figure 20.7 Illustration of how an external prior and the likelihood distribution from the trial data are combined into a posterior distribution. Source: Based on Tan, Wee, Wong, et al. (2008).

The distribution, obtained from the combination of Internal data (the likelihood) and the External information (the prior), is named the posterior distribution of the parameter. This also has a Normal distribution and from which a probability statement can then be derived concerning the true value of the parameter, βTreat. In the above explanation, we have used the terms ‘real’ and ‘weighted’ deaths. The ‘real’ corresponds to the actual numbers of deaths that are observed during the course of the randomised clinical trial. In summarising the external evidence, if this were to include a randomised trial of the exact same design the investigators are planning, then the reported deaths, say D, within that trial would be ‘real’. However, what is more likely

466

20 FURTHER TOPICS

is that the external information will be in some sense tangential to the exact trial question posed. For example, suppose rather than a randomised comparison, two single-arm studies had been conducted using exactly the same arms as those proposed in the trial in planning and between them the number of deaths reported was also, D. In this case, the evidence is not so reliable, so the external summary down weights D to a reduced value d. Thus, the total (internal plus external) number of ‘deaths’ would be regarded as D + d rather than 2D. 20.4.2 Constructing the priors The information available, and pertinent to the trial in question, has to be summarised into a prior distribution. An integral part of the process of constructing the priors is to perform a thorough and ongoing literature search using the standard approaches adopted by any systematic overview. Such a search may reveal a whole range of studies including, for example, randomised trials using the same regimens as those proposed for the new trial but in a different patient group, non-randomised comparative studies, single-arm studies of one or other of the intended arms and case series. Tan, Dear, Bruzzi, and Machin (2003) have suggested how each of such disparate studies may be judged within the context of the intended trial and then appropriately weighted before being merged into a single prior distribution such as that of Figure 20.7. The mean of such a distribution would correspond to the planning effect size, which can then be assessed by the design team as clinically worthwhile in their context and hence, if regarded as reasonable, used as a basis for sample size estimation purposes.

Example 20.8 Nasopharyngeal cancer At the calendar time of the design of the trial SQNP01 comparing chemo-radiation (CRT) and radiotherapy alone (RT) in patients with nasopharyngeal cancer reported by Wee, Tan, Tai, et al. (2005), information in the literature was available on nine publications relevant to the trial design although these were not amalgamated into a prior distribution to obtain a planning effect size. However, once the SQNP01 trial was closed and reported, a case study was undertaken using the suggested approach of Tan, Dear, Bruzzi, and Machin (2003). Omitting details their synthesis resulted in a prior distribution with mean log HR = −0.93 (HR = 0.39). Thus, had this prior been constructed at the planning stage of SQNP01 it would have clearly indicated the possibility of a considerable advantage of CRT over RT.

In contrast with the situation of Example 20.8, in other cases there may be little external evidence available. In such circumstances, one approach is to elicit clinical opinion about the likely benefit (of Test over Standard) from a wide range of individuals knowledgeable about both the disease in question and the alternative approaches to therapy. This information can then be collated and an ‘elicited prior distribution’ obtained.

20.5 INTERIM ANALYSES

467

Example 20.9 Hepatocellular carcinoma Although information was available from one randomised trial, which had perhaps been prematurely closed, Tan, Chung, Tai, et al. (2003) sought the opinions of 14 different investigators experienced in the treatment of hepatocellular carcinoma to develop an elicited prior distribution. Following the methodology suggested by Spiegelhalter, Freedman and Parmar (1994), this resulted in a prior with log HR = −0.47 (HR = 0.6) and SE = 0.43 derived from their opinions with respect to the advantage in 2-year recurrence-free survival of iodine-131-lipiodol as adjuvant treatment over surgery alone. This information was then used to inform the planning of a confirmatory trial that was then conducted and reported by Chung, Ooi, Machin, et al. (2013). Subsequently Gandhi, Tan, Chung and Machin (2015) have combined the results from that trial with external evidence and conclude from their synthesis that there remains some scepticism concerning the value of using iodine-131-lipiodol in this context.

20.5

Interim analyses

20.5.1 Alpha-spending function The application of the interim analysis stopping rules of O’Brien and Fleming (1979) and Pocock (1983), described in Figure 10.3, have some practical limitations. These include having to prespecify the number of planned interim looks as well as the timing of each interim analysis. They also require having approximately equal number of subjects or (as applicable) events between each analysis. Lan and DeMets (1983) proposed a more flexible approach based on a so-called α-spending function, α∗(t), to allow the DSMB to change the timing and frequency of the interim analyses. In essence, for given test size α (say, 0.05) used for determining the size of the trial, any interim analysis conducted prior to the final analysis uses a test size, α∗(t), which varies according to the proportion of the total information anticipated that is accrued, t, at the time when the interim data are available for analysis. An important feature of these functions is that when the final analysis is conducted, at t = 1, the level of significance used is that defined at the planning stage of the trial design. The simplest model Lan and DeMets (1983) describe is one in which the α-spending function is a proportion of α, that is, α∗1 t = tα

(20.6)

In this case, if the proportion of the total information accrued t = 0.25, then for α = 0.05, α∗1 0 25 = 0.25 × 0.05 = 0.0125. Thus, the interim analysis at that stage would use this as the threshold for decision making.

468

20 FURTHER TOPICS

Two other choices for α∗(t) are stipulated by Lan and DeMets (1983). The first is α∗2 t = 2 1 − Φ z 1 − α

2

t

(20.7)

In this equation, Φ(.) denotes the cumulative probability function of the standard Normal distribution of Equation (8.2). In this case, if the proportion of the total information accrued t = 0.25, then from Table T3, z0.975 = 1.96 and Φ 1 96 0 25 = Φ 3 92 = 0.99996 from Table T2. Hence α∗2 0 25 = 2 × [1 − 0.99996] = 0.00008 which is close to the O’Brien–Fleming (1979) stopping boundary at the first of four (three interim and 1 final) planned analyses as given in Figure 10.3. The other rule is α∗3 t = α log 1 + e − 1 t ,

(20.8)

where the exponential constant e = 2.71828 …. In this case, if the proportion of the total information accrued t = 0.25, then α∗3 0 25 = 0 05 × log 1 + 0 25 × e − 1 = 0.0178 which is very close to the Pocock (1983) stopping boundary of 0.018 for four analyses as given in Figure 10.3.

α = 0.05

Cumulative Type I Error

0.04

0.03 α∗3 t = α log 1 + e − 1 t α∗1 t = tα

0.02 α∗2 t = 2 1 − Φ z 1 − α

2

t

0.01

0 0

0.2

0.4

0.6

0.8

1

Proportion of total information accrued, t

Figure 20.8 Three alpha-spending functions for interim data monitoring described by Lan and DeMets (1983, pp. 660–661). Source: Based on Lan and DeMets (1983).

20.5 INTERIM ANALYSES

469

As is clear from Figure 20.8 the three options described will lead (apart from the final analysis) to different stopping rules at any given proportion of the total information anticipated, t. For example, use of Equation (20.8) would lead to recommendations to close trials earlier than would Equation (20.7). Thus, as indicated by Lan and DeMets (1983, p. 661), ‘Clearly, choice of α∗(t) should be made before the data are monitored’. Thus, one aspect to the work of a DSMB is to select the most appropriate stopping rule (if any) to be used amongst these and those of Figure 10.3.

20.5.2 Updated prior The prior obtained for planning purposes, or the initial planning prior updated from new external evidence accumulated during the course of a trial, may also be used in monitoring trial progress. Thus, for example, as well as providing an independent DSMB (as described in Chapter 10) with evidence from the accumulating data in the trial, this external information may assist them when making their recommendation with respect to the future course of the trial under review. Possibly to recommend an increase in the trial size over the original plan or early closure of the trial if the current data and the (updated) prior, each and when combined, indicate either futility or strong evidence of benefit to patients receiving one intervention arm.

20.5.3 Predictive probability and predictive power tests We introduced in Section 10.4, when discussing monitoring safety and data, the concept of stopping an RCT for efficacy and for futility reasons. The latter by means of the conditional power test, CP(t), of Equation (10.1). However, another possibility is to use a Bayesian approach to futility monitoring by means of the predictive probability test, PP (t). This contrasts with the frequentist view of probability and estimates the Bayesian posterior probability of observing a clinically important treatment effect. The trial is declared positive if the posterior probability of a clinically meaningful difference, at the stage of the trial when futility is being considered, is greater than a prespecified confidence level, η. Thus, PP t = P pT t − pS t > δ observed data > η,

(20.9)

where 0 < η < 1 with a value of η = 0.2 suggested. Here pT(t) and pS(t) denote the proportion of observed events in the two treatment groups at the time of the interim analysis, t, whilst δ is a clinically important treatment difference of concern often set as δPlan when trial size is being determined. In addition to knowing the number of events in each group at the interim analysis, the number of future events that occur between the specific interim and final analyses

470

20 FURTHER TOPICS

may be predicted. These are based on the interim data as well as the prior distributions of pT(t) and pS(t) which are assumed to follow a beta-binomial distribution. The betabinomial distribution is essentially a binomial distribution with parameter π, but with π following the shape of a beta distribution rather than having a fixed value. Another option, for testing futility is the predictive power test. Briefly, this may be regarded as a mixed Bayesian-frequentist approach, as it averages the (frequency based) conditional probability, CP(t), discussed in Section 10.4. The prior distribution is formulated for a series of alternative treatment effects, δ, and subsequently updated to give the posterior distributions of the treatment effects chosen given the observed interim data. Full details of the formulation and computational procedures of the predictive probability and the predictive power tests can be found in Dmitrienko, Molenberghs, Chuang-Stein and Offen (2005) who give macros for their implementation for both continuous and binary outcomes.

Example 20.10

Gastric cancer

The protocol of Kim, Chen, Tay, et al. (2017) in Example 10.2 described the EXPEL trial the results of which are given by Yang, Ji, Han, et al., (2021). This was an open-label, multicentre RCT involving 22 centres which included patients, aged 21–80 years with Stage 3 or 4 gastric cancer undergoing curative resection (CR), who were randomised to receive either CR alone or CR with extensive intraoperative peritoneal lavage EIPL. The primary outcome was overall survival (OS). A survival probability at 3 years of 60% was assumed for CR with a postulated benefit of δ = 0.1 (10%) by including EIPL. A 2-sided test size of 5%, power 80%, an attrition rate of 10% and a 1 : 1 randomisation were specified, and a total of 800 patients was planned in order to observe 250 deaths. As indicated in Figure 20.9, three interim analyses were conducted. However, in practice the first two assessing superiority whilst the two futility analyses reported here were calculated retrospectively as an exercise at the time. Futility analysis was conducted only at the third interim analysis. At that stage, the trial was closed to recruitment at 797 patients, with 141 (56%) of the 250 anticipated deaths recorded,

Interim analysis

CR alone Number Number of of patients, deaths (nCR) (dCR)

CR with EIPL Number Number of of patients deaths (nEPIL) (dEIPL)

Information fraction, t = (dCR + dEPIL)/250

Predicted Probability, PP(t) δ 0

0.05

0.10

1. As at Aug 2016

173

6

176

4

0.040

0.39

0.0013

< 0.0001

2. As at Aug 2017

269

29

269

28

0.228

0.36

0.0008

< 0.0001

3. At Aug 2019

396

74

401

67

0.564

0.75

0.0045

< 0.0001

Figure 20.9 Futility analysis assuming 3-year survival of 60% with surgery alone and 70% with the addition of EIPL in patients with gastric cancer. Source: Data from Yang, Ji, Han, et al. (2021).

20.5 INTERIM ANALYSES

Example 20.10

471

(Continued)

but with follow-up continuing. The object of the futility analysis was to determine whether or not the trial results could be reported before the planned 250 deaths had been observed. With the planned improvement in OS, following CR with the inclusion of EIPL, of δPlan = 0.1 the corresponding predictive probability was PP(t) < 0.0001. So that, even with more follow-up in order to record more deaths, the likelihood of achieving the trial planning objective was extremely unlikely. A smaller difference of δ = 0.05 was also very unlikely to be demonstrated as PP(t) = 0.0045. In view of these results, the investigators were advised that preparation for publication should proceed. The retrospective analyses of August 2016 and 2017 also indicated futility but, with so few events observed (compare to the 250 anticipated), these dates are too early to provide meaningful conclusions.

The use of an interim analysis in the post completion of recruitment period of a trial was indicated by the DAMOCLES Study Group (2005, p. 713) who state that an interim analysis at this stage … might only result in early release of trial data.

20.5.3.1 Final analysis Finally, once the trial data are complete, the continually updated Bayesian prior is combined with the likelihood distribution from the data to give the posterior distribution, from which estimates of the treatment effect can be obtained. 20.5.3.2 Very small trials For those designing trials in rare tumour types or small disease subgroups, standard approaches to determining trial size for given anticipated effect size, test size and power, will lead to a suggested value far beyond what is feasible whatever the duration of the recruitment period and extent of multicentre collaboration. Before deciding on a strategy for very small trials we need some idea of what is meant by ‘very small’. Thus Tan, Dear, Bruzzi and Machin (2003) define this as less than 50 possible patients recruited in 5 years with multicentre-multinational collaboration. So, under this definition a single centre can never conduct a very small trial – it must be multicentre. This was a criticism we made in commenting on the single institution and small adaptive trial of Example 20.4 in which 34 patients with acute myeloid leukaemia with adverse karyotype were recruited in 7 months. Of course, if the anticipated effect size is large, for example an improvement in survival rates in a cancer from 40 to 80% then, with 2-sided α = 0.05, 1 – β = 0.8, Equation (9.13) leads to a trial size of a few more than 50. In these circumstances, such

472

20 FURTHER TOPICS

a trial would be feasible. However, in most situations the true effect size is likely to be very much smaller than this so any randomised trials would be correspondingly larger. This difficulty has often led investigators to conduct single-arm trials of all the available patients. The argument being that all patients will be available to receive the Test option, that is twice as many as would be the case in a randomised controlled trial and hence would provide more information. Treatment comparisons are then made with previous experience of similar patients. However, by not conducting a randomised comparison one is left with all the difficulties of interpretation whatever the outcome so this is not an approach we would recommend. As we have pointed out, Gerb and Köpcke (2007) have stated that adaptive designs may be appropriate in these circumstances but also suggest, without describing them, that Bayesian approaches are also possible. The suggestion that Tan, Dear, Bruzzi and Machin (2003) make is to construct a prior from pertinent information available and/ or expert opinion, then to randomise equally to the (two) intervention groups as many patients as one can in an agreed time frame. The effect size and corresponding confidence interval can be estimated from the patient data that is accumulated, and this likelihood distribution information is then combined with the updated prior distribution to form the posterior distribution that is then used to help with the final interpretation. One has to recognise that conclusions drawn from this evidence, although the best possible given the circumstances, is seldom likely to be conclusive but will provide a firm basis for rational decision making. 20.5.3.3 Comment The approach outlined is very relevant for rare diseases and conditions whilst the formal synthesis of the accumulating external evidence component of this process is a valuable exercise in itself in all circumstances – whatever the ultimate size of the trial concerned. However, there is a need for formal standards and conventions to be established, including guidelines for the reporting of Bayesian designs for clinical trials.

20.6

Zelen randomised consent designs

In view of difficulties associated with obtaining informed patient consent to join a trial, various options have been proposed to minimise these difficulties. One suggestion is a Zelen (1992) design in which eligible patients are randomised to one of the two treatment groups before they are informed about the details of the trial. Once randomised, then those who are allocated to the standard treatment are all treated with it and no consent to take part in the trial is sought. This is the Standard (G1) arm of Figure 20.10. The ethical argument is that this is the treatment they would have received in the absence of the trial, so no permission is needed. On the other hand, those who are randomised to the experimental treatment (New, G2) are asked for their consent; if they

20.6 ZELEN RANDOMISED CONSENT DESIGNS

473

Eligible patients

Randomise Standard (G1)

New (G2) Obtain consent Yes

No

Treat with standard

Treat with new

Treat with standard

Evaluate

Evaluate

Evaluate

Compare

Figure 20.10 The Zelen single randomised consent design. Source: Altman, Whitehead, Parmar, et al. (1995). © 1995 Elsevier.

agree they are treated with New; if they disagree, then they are treated with the Standard. This is known as Zelen’s single consent design. An alternative to the single consent design is that those randomised to the Standard may also be asked if they are willing to undergo the treatment chosen for them. This is the Zelen double consent design of Figure 20.11. What they actually go on to receive, Standard or New, is then left to their choice. Thus, whichever intervention they are randomised to receive, they are made aware of the other option under test. However, Huibers, Bleijenberg, Beurskens, et al. (2004) propose a modification to this design in which, although patients are randomised before being approached and consent is then sought from both arms in the same way, neither arm is told of the existence of the alternative therapy. They used this design successfully in an occupational health trial in which fatigued employees absent from work were randomly assigned to receive cognitive behavioural therapy or no intervention. Whatever type of Zelen design chosen, the analysis must be made using the intention-to-treat principle, that is, it is based on the treatment to which patients were randomised, not the treatment they actually received if it differs from this. The chief difficulty is that these designs each involve some deception, and although carried out with the best of intent, this is difficult to square with an ethical approach. Also, most trials require additional assessments to be made on patients, even those in the Standard group, and so, consent is required for this and an explanation of why these are being made has to be provided. This process clearly nullifies the advantage of a Zelen approach in avoiding to seek consent from some of the patients.

474

20 FURTHER TOPICS Eligible patients

Randomise Standard

(G1)

New

Obtain consent Yes

(G2)

Obtain consent No

Yes

No

Treat with standard

Treat with new

Treat with new

Treat with standard

Evaluate

Evaluate

Evaluate

Evaluate

Compare

Figure 20.11 The Zelen double randomised consent design. Source: Altman, Whitehead, Parmar, et al. (1995). © 1995 Elsevier.

The properties of these designs have been examined in some detail by Altman, Whitehead, Parmar, et al. (1995) from a perspective of conducting clinical trials in cancer. They concluded that: There are serious statistical arguments against the use of randomised consent designs, which should discourage their use.

However, Berkowitz (1998) argued for their use in surgical trials to repair cleft palate on the grounds of alleged difficulties of obtaining consent by conventional methods. This point of view was rebutted by Machin and Lee (2000) who counter argued that in the whole reconstruction process, which usually takes several years, many procedures will be undertaken in the surgical management of the cleft that are regarded as standard and some will be used for which there is considerable uncertainty as to the correct approach. In these circumstances, a patient, or more realistically parents, could be guided through the entire reconstruction process before any intervention takes place. This therefore sensitizes the potential consent-giver from the very beginning that, although certain procedures are standard, there are others in which the best approach is not clear. Once at that (unclear) stage of the reconstruction process, a choice will have to be made. Therefore, a brief description of the future trial the child may enter, although distant in time for the child concerned, could be introduced and even reference made to the randomisation requirement. As care of the child proceeds stage-by-stage, more and more information about the trial can be provided, so that at the crucial point in time, the consent-giver, who may change from the proxy to the young adult concerned, will be truly informed of available options. Finally, Machin and Lee (2000) contrast this with the very difficult situation when a newly diagnosed cancer patient who, having been told of their lifethreatening condition, is simultaneously asked to be randomised into a clinical trial. It was for this latter type of circumstance that the Zelen approach was formulated.

20.6 ZELEN RANDOMISED CONSENT DESIGNS

475

Despite the criticism of its proposed use in the context of cleft lip repair and cancer, the Zelen design has been found appropriate in other situations.

Example 20.11

Vaginal cuff infections following hysterectomy

Larsson and Carlsson (2002) conducted a randomised clinical trial to investigate the role of Metronidazole as compared to a no treatment control in lowering vaginal cuff infection after hysterectomy amongst women presenting for surgery with bacterial vaginosis. In total 213 women were randomised and the process was described as follows by the authors but with our emphasis in italics: Randomization was done according to Zelen (1979), using sealed envelopes in blocks of ten patients and carried out before informed consent. The nurse asked patients at preoperative registration if they wanted to join the study, before the patients met the operating doctor at the final examination. Thus, women randomized to the non-treatment group were not asked to join the study, as no antibiotic treatment is the normal procedure at the clinic. Women randomized to the treatment group were asked to join the study and receive treatment.

The authors also state that the Zelen single randomised-consent design trial of Figure 20.10 had approval from both the regional ethics committee and the Swedish Medical Products Agency. However, the trial results reported do not give the number of women assigned to each treatment and the final analysis only refers to 142 women, 75 of whom were assigned to receive Metronidazole rectally and 67 to no treatment.

Example 20.12

Patients in acute psychiatric crisis

Cornelis, Barakat, Dekker, et al. (2018) describe a clinical protocol that intends to use the Zelen double randomised consent trial design of Figure 20.11 in patients in acute psychiatric crisis. This trial plans to compare Intensive Home Treatment (IHT) with Care-As-Usual (CAU) with the objective of: ‘a 33% reduction in hospitalisation days at 52 weeks post-treatment allocation in IHT compared to CAU’. The authors explain in some detail why, in this trial involving those in crisis and with the particular form of the interventions concerned, the Zelen design is chosen in preference to a design that obtains informed consent before randomisation.

Whether or not Zelen designs are appropriate is very context dependent and, although Larsson and Carlsson (2002) have used this approach, there are not so many examples in the literature for individual-patient randomised trials. However, Piaggio,

476

20 FURTHER TOPICS

Carolli, Villar, et al. (2001) argue for their use in the context of trials using the cluster designs of Chapter 16. They give an important example when comparing standard with new antenatal care; care which could only be implemented on a clinic-by-clinic basis. Clearly Zelen designs cannot be double-blind or use placebos.

20.7

Systematic overviews

Systematic reviews of the literature and other sources of information form an integral part of the design process for any proposed clinical trial. These reviews should encompass all pertinent information and not just, for example, earlier randomised trials. As we have stated, such a process is very useful if routinely applied before launching a new clinical trial as a means of confirming the need to carry out the planned clinical trial in question, during the trial for interim analysis purposes and after completion of the trial as a means of synthesising and summarizing the current knowledge on the research question of interest. In many instances several, or even many, randomised trials may have been conducted addressing the same or similar questions and although possibly none of these provides convincing evidence for a particular approach, taken together they may be firmly suggestive of a benefit. A systematic overview is the process of finding all the randomised trials that are pertinent to a particular question and extracting the necessary details of these. Meta-analysis is the method by which these individual trial results are then combined into an overall synthesis. Many aspects of their application to problems in health care are described by Egger, Davey-Smith and Higgins (2021). A systematic review combines the evidence from the individual trials identified to give a more powerful analysis of any treatment effect. However, it is important to realise that a review can only be as good as its component parts and if the trials being reviewed are of poor quality then inferences drawn from an overview will have to be made with extreme caution. In contrast, if the basic information is of high quality then their collective and systematic synthesis clearly adds substantially to the evidence base for clinical medicine. Although we will only give a brief introduction to the whole process, the principal stages of a systematic overview and meta-analysis are (i) the systematic identification of all trials that addressed the outcome of interest; (ii) the evaluation of the quality of these trials; (iii) the extraction of the relevant data; and (iv) the statistical combining and analysing of the collective results. A pivotal feature of the systematic review programme is the Cochrane Collaboration which is an international organisation dedicated to helping people to make well-informed decisions about health care. An invaluable guide to the whole process of conducting such reviews, and full details of the Cochrane Collaboration and Library can be found in Higgins, Thomas, Chandler, et al. (2019). Indeed, accessing this Library is one obvious starting point before embarking on a systematic review. An overview of the methodology pertinent to health-related quality of life (HRQoL) outcomes in clinical trials is given by Fayers and Machin (2016, Chapter 20).

20.7 SYSTEMATIC OVERVIEWS

477

20.7.1 The protocol As with a clinical trial, an important part of the systematic overview process is to prepare a protocol outlining the procedures that are to be followed. This will mirror most of the components of a Trial Protocol that we outlined in Chapter 3. Shamseer, Moher, Clarke, et al. (2015) specify items that are required when constructing the protocol to be developed prior to a systematic review and meta-analysis being conducted. Thus, for example, the objectives, particular intervention types to be compared, eligible patient groups, and statistical methods all need to be specified. However, there are at least two sections that need to be added. One relates to the process of Literature Searching and the second to Assessing Quality of the trials so identified. 20.7.1.1 Literature search A major part of any systematic review is the literature search, to make sure that all available trials have been identified and included. It is usually the literature search that takes the greatest amount of time. Searching should address published literature and unpublished trials that were opened but never completed, and trials still in progress. A starting point is to search bibliographic databases and obtain abstracts of all potentially relevant articles. A key point at this stage is to ensure that a full range of applicable terms is included in the search strategy. After a relevant publication has been found, the list of references it contains should be scrutinised for any citations of trials not yet identified. In most publications, clues about the pertinence of the citations may be obtained from the introduction and the discussion sections. This searching process refers to completed and published trials. Ongoing trials can often be identified by accessing registers of clinical trials. Publication bias is a well-known problem in the reporting of clinical trials. Thus, journals are more likely to publish trials that obtain ‘interesting’ and positive results. This is especially the case if the trials are small, when those without such positive findings may fail to be published. Hence, other means have to be found to track such unpublished information down. A major justification for the compulsory registration of all clinical trials has been the countering of publication bias; so, it is becoming increasingly possible to establish at least the existence of unpublished and possibly negative trials. 20.7.1.2 Assessing quality Full information, such as copies of publications, should be obtained for each potentially usable trial. These can be graded for eligibility and overall quality. Pointers to assessing the quality of trials are provided by Jüni, Altman and Egger (2001). It is usually recommended that each trial is reviewed by more than one person. This is partly to spread the workload, but mainly to ensure that the ratings are consistent and of a reliable standard. It is therefore important that there should be a formal and prespecified procedure both for making the ratings and for resolving rater disagreements.

478

20 FURTHER TOPICS

20.7.2 Combining trial results If there are k trials to be combined, then the process extends that described in Equation (20.4) which is confined to combining information from k = 2 sources. In general there will be W1, W2, …, Wk, weights involved and each trial will provide an estimate of the treatment effect, say, d1, d2, …, dk. The latter replace, for example bData, and bExternal, in Equation (20.4) and we denote the result of the calculation from pooling the trials by dOverview, which is the estimate of the true treatment effect, δ. Similarly the standard error is derived by extending Equation (20.5) which we denote by SE dOverview . From these the 95% confidence interval for δ, which takes the same form as Equation (8.3), is dOverview − 1 96 × SE dOverview to dOverview + 1 96 × SE d Overview

(20.10)

20.7.3 Forest plot The standard way to present the results of a meta-analysis is a forest plot. This displays the point estimates for the separate trials and also for the overall effect, together with their associated confidence intervals. The weights for each trial are also shown graphically as a block. The area of the block and the confidence interval convey similar information, but both make different contributions to the graphic. The confidence intervals depict the range of treatment effects compatible with the results of each trial and indicate whether each was individually statistically significant. The size of the block draws the eye towards those trials with larger weight (narrower confidence intervals), which dominate the calculation of the pooled result. The pooled confidence interval is also shown, together with a lozenge indicating the value of the overall estimate.

Example 20.13

Ovarian cancer

The Advanced Ovarian Trialist Group (1991) sought to investigate the role of chemotherapy in advanced ovarian cancer on overall survival. To this end identified three single agent trials and eight combination therapy trials in which they were able to compare carboplatin with cisplatin. One of the trials was unpublished, and an important feature of this overview is that additional follow-up was obtained on the trial patients. So the synthesis was more up-to-date and was not merely a collation of the evidence that had appeared in the literature. One aspect of their overview is summarised in the forest plot of Figure 20.12 with a summary (unshaded) lozenge for the single agent carboplatin and cisplatin groups separately, together with an overall (shaded) lozenge for the two groups combined. In this case a Relative Risk (RR) > 1 favours cisplatin over carboplatin. The full results of the overview led to the launch of a new trial comparing cisplatin, doxorubicin and cyclophosphamide (CAP) with carboplatin alone.

20.7 SYSTEMATIC OVERVIEWS

Example 20.13

479

(Continued) No of events / No entered

Reference

Relative risk

Carboplatin

Cisplatin

O-E

Variance

Single agent: Adams et al 41

37 / 45

32 / 43

2.10

17.21

Mangioni et al 42

54 / 88

47 / 85

4.00

25.21

Wiltshaw et al 43

57 / 67

54 / 64

1.00

27.74

Subtotal

7.10

70.17

51.73

Combination: Alberts et al 44

105 / 168

102 / 170

–1.00

Anderson et al 45

14 / 27

22 / 29

–4.30

8.86

Conte et al 46

55 / 83

49 / 82

1.60

25.95

Edmonson et al 47

37 / 50

33 / 54

6.00

17.44

Kato et al 48

4 / 28

5 / 23

–1.50

2.22

Meerpohl et al

31 / 91

21 / 82

3.90

12.80

123 / 224

126 / 223

–1.00

62.25

97/ 168

5.30

50.72

Subtotal

9.00

231.97

Total

16.10

302.14

Pater 49

(unpublised ref (F))

ten Bokkel Huinink et al 50 106/ 167

0·0 Test for heterogeneity: χ 2(10) = 7.11; p = 0.715

0·5 Carboplatin better

1.0

1·5

2·0

Cisplatin better

Figure 20.12 Forest plot following an overview and meta-analysis of 11 randomised trials in ovarian cancer. Source: Advanced Ovarian Trialist Group. (1991). © 1991 BMJ Publishing Group Ltd.

20.7.4 Heterogeneity The bottom left hand corner of Figure 20.12 contains the following statement: ‘Test for heterogeneity: χ 210 = 7.11; p = 0.715’ and this highlights one of the complications of meta-analysis. In brief, the methods we have referred to make the assumption that the true effect sizes for each trial are the same or homogeneous. In other words, it is assumed that if all trials had enrolled huge numbers of patients they would all have given the same effect size. In terms of forest plots, homogeneity implies that the confidence intervals of the trials should be largely overlapping. A visual inspection of the forest plot in this example shows that this is clearly the case. Homogeneity may not be a reasonable assumption, especially when trials have varying entry criteria, for example in age or severity of disease. Another frequent cause of heterogeneity is when the trials to be summarised apply treatments using varying dosage levels. This variation may result in corresponding variations in the levels of the response rates observed. In such cases, the association between the presumed factors (age, disease severity, or dosage) and the reported individual trial effect sizes can be investigated.

480

20 FURTHER TOPICS

The presence of heterogeneity may be explored by calculating the statistic k

W j × d j − d Overview

Q=

2

(20.11)

j=1

There is statistically significant heterogeneity if Q exceeds the value from a χ 2 distribution with degrees of freedom, df = k − 1, where k is the number of trials concerned. As indicated, in the case of Figure 20.12, Equation (20.11) gives Q = 7.11 and p-value = 0.715 suggesting no statistically significant departure from the assumption of homogeneity. On the other hand, had heterogeneity been established then it would have been inappropriate to summarise the conclusions by the final (shaded) summary lozenge. Systematic reviews are a considerable undertaking but Shamseer, Moher, Clarke, et al. (2015) describe the preferred reporting items for systematic reviews and meta-

Example 20.14

Corneal astigmatism

Lake, Victor, Clare, et al. (2019), as part of a systematic review to assess the effects of toric intraocular lens (IOLs) and limbal relaxing incisions (LRI) in the management of astigmatism during phacoemulsification cataract surgery, considered the five randomised controlled trials indicated in Figure 20.13. In this example, the outcome (presence or absence of astigmatism at 6 months) is a binary variable and the effect is summarised by the relative risk (RR), here termed the Risk Ratio. Thus, for example, the Frietas 2014 trial resulted in a RR = (28/30)/(20/32) = 1.49 which favoured IOLs. The weights are related to the size of the inverse of the variance of RR. The smaller the variance, the larger the weight. These determine the areas of the boxes on each of the illustrated confidence intervals.

Study or Subgroup Freitas 2014 Hirnschall 2014 Mingo-Botin 2010 Nanavaty 2017 Titiyal 2014

Toric IOLs LRI Events Total Events Total Weight 28 15 18 25 10

30 28 20 34 17

20 11 8 25 7

32 28 20 36 17

31.7% 13.5% 14.2% 30.6% 10.0%

Risk Ratio IV, Random, 95% CI 1.49 [1.12, 1.99] 1.36 [0.77, 2.42] 2.25 [1.29, 3.92] 1.06 [0.79, 1.42] 1.43 [0.71, 2.86]

Risk Ratio IV, Random, 95% CI

Risk of Bias A B C D E F + + ? + +

? + ? + ?

? + ? ? ?

? + ? + ?

+ + ? ? +

? + ? ? ?

129 133 100.0% 1.40 [1.10, 1.78] Total (95%CI) Total events 96 71 Heterogeneity Tauz = 0.03; Chiz = 6.33, df = 4 (P = 0.18); Iz = 37% 0.1 0.2 0.5 1 2 5 10 Test for overall effect: Z = 2.73 (P = 0.006) Favours LRIs Favours toric IOLs Risk of bias legend (A) Random sequence generation (selection bias) (B) Allocation concealment (selection bias) (C) Blinding of participants and personnel (performance bias) (D) Blinding of outcome assessment (detection bias) (E) Incomplete outcome data (attrition bias) (F) Selective reporting (reporting bias)

Figure 20.13 Postoperative residual refractive astigmatism at 6 months or more after surgery by Toric Intraocular Lens (IOLs) versus Limbal Relaxing Incisions (LRI) for corneal astigmatism after phacoemulsification. Source: Lake, Victor, Clare, et al. (2019). © 2019 John Wiley & Sons

20.7 SYSTEMATIC OVERVIEWS

Example 20.14

481

(Continued)

Over the period since the Advanced Ovarian Trials Group (1991) overview of Figure 20.12, thousands of systematic reviews have been conducted so that features such as the ‘Risk of Bias’ section of Figure 20.13 have now been added. These give indications, trial-by-trial, whether potential biases may impact on the reliability of the conclusions. In this example, Himschall 2014 appears to be bias free whilst there are some concerns with the Mingo-Botin 2010 trial. From their review, Lake, Victor, Clare, et al. (2019) conclude that: Toric IOLs probably provide a higher chance of achieving astigmatism within 0.5 D after cataract surgery with LRIs.

analysis (PRISMA) aimed to assist investigators with this task. The items required are stipulated in the document with explanation and examples provided. There is also (their Table 2) a checklist of recommended items to address when preparing a systematic review protocol.

Statistical Tables Table T1 Random numbers 75792 80169 94071 67970 91577 84334 03778 58563 29068 90047 54870 23327 03876 14846 94731 96046 95188 67416 50002 50806 43619 90476 43241 57434 15731 34706 16759 11895

78245 88847 63090 29162 43019 54827 05031 84810 74625 44763 35009 78957 89100 86619 63786 51589 25011 00626 97121 62492 79413 58785 22852 86821 12986 04386 74867 74173

83270 36686 23901 60224 67511 51955 90146 22446 90665 44534 84524 50987 66895 04238 88290 84509 29947 49781 26652 67131 45456 15177 28915 63717 03008 02945 62702 72423

59987 36601 93268 61042 28527 47256 59031 80149 52747 55425 32309 77876 89468 36182 60990 98162 48896 77833 23667 02610 31642 81377 49692 54640 18739 72555 32840 62838

75253 91654 53316 98324 61750 21387 96758 99676 09364 67170 88815 63960 96684 05294 98407 39162 83408 47073 13819 43964 78162 26671 75981 28782 07726 97249 08565 89382

42729 44249 87773 30425 55267 28456 57420 83102 57491 67937 86792 53986 95491 43791 43437 59469 79684 59147 54138 19528 81686 70548 74215 24046 75512 16798 18403 57437

98917 52586 89260 37677 07847 77296 23581 35381 59049 88962 89097 46771 32222 88149 74233 60563 11353 50469 54173 68333 73687 41383 65915 84755 65295 05643 10421 85314

83137 25702 04804 90382 50165 41283 38824 94030 19767 49992 66600 80998 58708 22637 25880 74917 13636 10807 69234 69484 19751 59773 36489 83021 15089 42343 60687 75320

67588 09575 99479 96230 26793 01482 49592 59560 83081 53583 26195 95229 34408 56775 96898 02413 46380 58985 28657 23527 24727 59835 10233 85436 81094 36106 68599 01988

93846 18939 83909 84565 80918 44494 18593 32145 78441 37864 88326 59606 66930 52091 52186 17967 69003 98881 01031 96974 98742 13719 89897 29813 05260 63948 78034 52518

(Continued )

Randomised Clinical Trials: Design, Practice and Reporting, Second Edition. David Machin, Peter M. Fayers, and Bee Choo Tai. © 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.

484

STATISTICAL TABLES

Table T1 (Continued) 87597 63656 72414 69337 64310 31243 39951 57473 50259 48449 50830 94646 49344 49201 57221 65391 01029 23218 35175 28442 89327

21289 28328 71686 19016 62819 63913 83556 53613 80588 97696 52921 37630 07037 66377 54927 57289 99783 67476 51935 12188 26880

30904 25428 65513 50420 20242 66340 88718 76478 94408 14321 41365 50246 24221 64188 59025 67771 63250 45675 85800 99908 83020

13209 38671 81236 38803 08632 91169 68802 82668 55754 92549 46257 53925 41955 50398 46847 99160 39198 17299 91083 51660 20428

04244 97372 26205 55793 83905 28560 06170 28315 79166 95812 66889 95496 47211 33157 35894 08184 51042 85685 97112 34350 87554

53651 69256 10013 84035 49477 69220 90451 05975 20490 78371 29420 82773 43418 87375 14639 26262 36834 57294 20865 66572 33251

28373 49364 80610 93051 29409 14730 58926 96324 97112 77678 95250 41021 45703 55885 38452 46577 40450 30847 96101 43047 80684

90759 35398 40509 57693 96563 19752 50125 96135 25904 56618 24080 95435 78779 14174 89166 32603 90864 39985 83276 30217 01964

70286 30808 50045 33673 86993 51636 28532 14255 20254 44769 08600 83812 77215 03105 72843 21677 49953 44402 84149 44491 04106

49678 59082 70530 67434 91207 59434 17189 29991 08781 57413 04189 52558 44594 85821 40954 54104 61032 76665 11443 79042 28243

Each digit 0–9 is equally likely to appear and cannot be predicted from any combination of other digits.

Figure T1 The probability density function of a standardised Normal distribution tabulated in Table T2

Probability density

0.4

0.3

0.2 γ 1– γ 0.1

0 –4

–3

–2

–1

0

1

Standardised variable, z



3

4

Table T2 The Normal distribution function – Probability, γ, that a Normally distributed standardised variable is less than zγ z 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.50000 0.53983 0.57926 0.61791 0.65542 0.69146 0.72575 0.75804 0.78814 0.81594 0.84134 0.86433 0.88493 0.90320 0.91924 0.93319 0.94520 0.95543 0.96407 0.97128

0.50399 0.54380 0.58317 0.62172 0.65910 0.69497 0.72907 0.76115 0.79103 0.81859 0.84375 0.86650 0.88686 0.90490 0.92073 0.93448 0.94630 0.95637 0.96485 0.97193

0.50798 0.54776 0.58706 0.62552 0.66276 0.69847 0.73237 0.76424 0.79389 0.82121 0.84614 0.86864 0.88877 0.90658 0.92220 0.93574 0.94738 0.95728 0.96562 0.97257

0.51197 0.55172 0.59095 0.62930 0.66640 0.70194 0.73565 0.76730 0.79673 0.82381 0.84849 0.87076 0.89065 0.90824 0.92364 0.93699 0.94845 0.95818 0.96638 0.97320

0.51595 0.55567 0.59483 0.63307 0.67003 0.70540 0.73891 0.77035 0.79955 0.82639 0.85083 0.87286 0.89251 0.90988 0.92507 0.93822 0.94950 0.95907 0.96712 0.97381

0.51994 0.55962 0.59871 0.63683 0.67364 0.70884 0.74215 0.77337 0.80234 0.82894 0.85314 0.87493 0.89435 0.91149 0.92647 0.93943 0.95053 0.95994 0.96784 0.97441

0.52392 0.56356 0.60257 0.64058 0.67724 0.71226 0.74537 0.77637 0.80511 0.83147 0.85543 0.87698 0.89617 0.91308 0.92785 0.94062 0.95154 0.96080 0.96856 0.97500

0.52790 0.56749 0.60642 0.64431 0.68082 0.71566 0.74857 0.77935 0.80785 0.83398 0.85769 0.87900 0.89796 0.91466 0.92922 0.94179 0.95254 0.96164 0.96926 0.97558

0.53188 0.57142 0.61026 0.64803 0.68439 0.71904 0.75175 0.78230 0.81057 0.83646 0.85993 0.88100 0.89973 0.91621 0.93056 0.94295 0.95352 0.96246 0.96995 0.97615

0.53586 0.57535 0.61409 0.65173 0.68793 0.72240 0.75490 0.78524 0.81327 0.83891 0.86214 0.88298 0.90147 0.91774 0.93189 0.94408 0.95449 0.96327 0.97062 0.97670 (Continued )

Table T2 (Continued) z 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.97725 0.98214 0.98610 0.98928 0.99180 0.99379 0.99534 0.99653 0.99744 0.99813 0.99865 0.99903 0.99931 0.99952 0.99966 0.99977 0.99984 0.99989 0.99993 0.99995

0.97778 0.98257 0.98645 0.98956 0.99202 0.99396 0.99547 0.99664 0.99752 0.99819 0.99869 0.99906 0.99934 0.99953 0.99968 0.99978 0.99985 0.99990 0.99993 0.99995

0.97831 0.98300 0.98679 0.98983 0.99224 0.99413 0.99560 0.99674 0.99760 0.99825 0.99874 0.99910 0.99936 0.99955 0.99969 0.99978 0.99985 0.99990 0.99993 0.99996

0.97882 0.98341 0.98713 0.99010 0.99245 0.99430 0.99573 0.99683 0.99767 0.99831 0.99878 0.99913 0.99938 0.99957 0.99970 0.99979 0.99986 0.99990 0.99994 0.99996

0.97932 0.98382 0.98745 0.99036 0.99266 0.99446 0.99585 0.99693 0.99774 0.99836 0.99882 0.99916 0.99940 0.99958 0.99971 0.99980 0.99986 0.99991 0.99994 0.99996

0.97982 0.98422 0.98778 0.99061 0.99286 0.99461 0.99598 0.99702 0.99781 0.99841 0.99886 0.99918 0.99942 0.99960 0.99972 0.99981 0.99987 0.99991 0.99994 0.99996

0.98030 0.98461 0.98809 0.99086 0.99305 0.99477 0.99609 0.99711 0.99788 0.99846 0.99889 0.99921 0.99944 0.99961 0.99973 0.99981 0.99987 0.99992 0.99994 0.99996

0.98077 0.98500 0.98840 0.99111 0.99324 0.99492 0.99621 0.99720 0.99795 0.99851 0.99893 0.99924 0.99946 0.99962 0.99974 0.99982 0.99988 0.99992 0.99995 0.99996

0.98124 0.98537 0.98870 0.99134 0.99343 0.99506 0.99632 0.99728 0.99801 0.99856 0.99896 0.99926 0.99948 0.99964 0.99975 0.99983 0.99988 0.99992 0.99995 0.99997

0.98169 0.98574 0.98899 0.99158 0.99361 0.99520 0.99643 0.99736 0.99807 0.99861 0.99900 0.99929 0.99950 0.99965 0.99976 0.99983 0.99989 0.99992 0.99995 0.99997

STATISTICAL TABLES

487

Table T3 Percentage points of the standardised Normal distribution for given α and 1 − β (some frequently used entries are highlighted) 1-sided: α

2-sided: α

1-sided: 1 − β

z

0.001 0.0027 0.005 0.01 0.02 0.025 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

0.9995 0.99865 0.9975 0.995 0.99 0.9875 0.975 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6

3.2905 3.0000 2.8070 2.5758 2.3263 2.2414 1.9600 1.6449 1.2816 1.0364 0.8416 0.6745 0.5244 0.3853 0.2533

0.0005 0.00135 0.0025 0.005 0.01 0.0125 0.025 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Figure T2

The probability density function of the Student’s t-distribution tabulated in Table T4 0.4

Probability density

0.3

0.2

0.1 α/2

1– α

α/2

0 –t1– α/2

0 Standardised variable, t

t1– α/2

488

STATISTICAL TABLES

Table T4 The value tabulated is t1 − α/2, such that if X is distributed as Student’s t-distribution with df degrees of freedom, then α is the probability that X ≤ −t1 − α/2 or X ≥ t1 − α/2 α df

0.20

0.10

0.05

0.04

0.03

0.02

0.01

0.005

0.001

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

3.078 1.886 1.638 1.533 1.476 1.440 1.415 1.397 1.383 1.372 1.363 1.356 1.350 1.345 1.341 1.337 1.333 1.330 1.328 1.325 1.323 1.321 1.319 1.318 1.316 1.315 1.314 1.313 1.311 1.310 1.309 1.309 1.308 1.307 1.306

6.314 2.920 2.353 2.132 2.015 1.943 1.895 1.860 1.833 1.812 1.796 1.782 1.771 1.761 1.753 1.746 1.740 1.734 1.729 1.725 1.721 1.717 1.714 1.711 1.708 1.706 1.703 1.701 1.699 1.697 1.696 1.694 1.692 1.691 1.690

12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.160 2.145 2.131 2.120 2.110 2.101 2.093 2.086 2.080 2.074 2.069 2.064 2.060 2.056 2.052 2.048 2.045 2.042 2.040 2.037 2.035 2.032 2.030

15.895 4.849 3.482 2.999 2.757 2.612 2.517 2.449 2.398 2.359 2.328 2.303 2.282 2.264 2.249 2.235 2.224 2.214 2.205 2.197 2.189 2.183 2.177 2.172 2.167 2.162 2.158 2.154 2.150 2.147 2.144 2.141 2.138 2.136 2.133

21.205 5.643 3.896 3.298 3.003 2.829 2.715 2.634 2.574 2.527 2.491 2.461 2.436 2.415 2.397 2.382 2.368 2.356 2.346 2.336 2.328 2.320 2.313 2.307 2.301 2.296 2.291 2.286 2.282 2.278 2.275 2.271 2.268 2.265 2.262

31.821 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.718 2.681 2.650 2.624 2.602 2.583 2.567 2.552 2.539 2.528 2.518 2.508 2.500 2.492 2.485 2.479 2.473 2.467 2.462 2.457 2.453 2.449 2.445 2.441 2.438

63.657 9.925 5.841 4.604 4.032 3.707 3.499 3.355 3.250 3.169 3.106 3.055 3.012 2.977 2.947 2.921 2.898 2.878 2.861 2.845 2.831 2.819 2.807 2.797 2.787 2.779 2.771 2.763 2.756 2.750 2.744 2.738 2.733 2.728 2.724

127.321 14.089 7.453 5.598 4.773 4.317 4.029 3.833 3.690 3.581 3.497 3.428 3.372 3.326 3.286 3.252 3.222 3.197 3.174 3.153 3.135 3.119 3.104 3.091 3.078 3.067 3.057 3.047 3.038 3.030 3.022 3.015 3.008 3.002 2.996

636.619 31.599 12.924 8.610 6.869 5.959 5.408 5.041 4.781 4.587 4.437 4.318 4.221 4.140 4.073 4.015 3.965 3.922 3.883 3.850 3.819 3.792 3.768 3.745 3.725 3.707 3.690 3.674 3.659 3.646 3.633 3.622 3.611 3.601 3.591 (Continued )

STATISTICAL TABLES Table T4

489

(Continued) α

df

0.20

0.10

0.05

0.04

0.03

0.02

0.01

36 37 38 39 40 ∞

1.306 1.305 1.304 1.304 1.303 1.290

1.688 1.687 1.686 1.685 1.684 1.660

2.028 2.026 2.024 2.023 2.021 1.984

2.131 2.129 2.127 2.125 2.123 2.081

2.260 2.257 2.255 2.252 2.250 2.201

2.434 2.431 2.429 2.426 2.423 2.364

2.719 2.715 2.712 2.708 2.704 2.626

0.005 2.990 2.985 2.980 2.976 2.971 2.871

0.001 3.582 3.574 3.566 3.558 3.551 3.390

Table T5 Simon, Wittes and Ellenberg design Smallest response rate

Number of treatments, g

π Small 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

2 42 53 62 69 75 79 82 82 82

δPlan = 0.1, PCS = 0.9 3 62 79 93 104 113 119 123 124 123

4 74 95 111 125 136 144 149 150 149

5 83 106 125 141 153 162 167 169 167

6 90 115 136 153 166 175 181 183 182

π Small 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

2 21 26 29 32 35 36 37 37 36

δPlan = 0.15, PCS = 0.9 3 31 38 44 48 52 54 55 55 54

4 37 45 52 58 62 65 67 67 65

5 41 51 59 65 70 73 75 75 73

6 45 55 64 71 76 79 81 81 80

For a probability of correctly selecting the best treatment PCS = 0.9, the table gives the number of patients n in each group required to identify the best of g treatments under investigation as compared to the least effective with anticipated response rate π Small and difference in response rate over that specified as δPlan.

490

STATISTICAL TABLES

Table T6 The multiplier κ of the estimated standard deviation, sPilot, to obtain the γ upper confidence limit (UCL) estimate, sInflated, for the eventual standard deviation, σ Plan, for estimating the sample size of the Main trial (Equation 19.2) Degrees of freedom df

γ = 0.8

γ = 0.9

γ = 0.95

Degrees of freedom df

γ = 0.8

γ = 0.9

γ = 0.95

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

3.947 2.117 1.728 1.558 1.461 1.398 1.353 1.320 1.293 1.272 1.255 1.240 1.227 1.216 1.206 1.198 1.190 1.183 1.177 1.171

7.958 3.081 2.266 1.939 1.762 1.650 1.572 1.514 1.469 1.434 1.404 1.380 1.359 1.341 1.325 1.311 1.298 1.287 1.277 1.268

15.947 4.415 2.920 2.372 2.089 1.915 1.797 1.711 1.645 1.593 1.346 1.335 1.326 1.316 1.308 1.300 1.293 1.286 1.280 1.274

21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 40 50 60 80 100

1.166 1.161 1.157 1.153 1.149 1.145 1.142 1.139 1.136 1.133 1.131 1.128 1.126 1.123 1.121 1.112 1.098 1.088 1.075 1.066

1.259 1.252 1.245 1.238 1.232 1.226 1.221 1.216 1.211 1.207 1.203 1.199 1.195 1.191 1.188 1.173 1.152 1.136 1.116 1.102

1.346 1.335 1.326 1.316 1.308 1.300 1.293 1.286 1.280 1.274 1.268 1.263 1.258 1.253 1.248 1.228 1.199 1.179 1.151 1.133

Table T7 Total sample size for a two equal group Main trial for standardised difference ΔPlan/sPilot assuming 2-sided α = 0.05 and power 1 – β = 0.9 Standardised effect size, ΔPlan/sPilot dfPilot

0.25

0.3

0.4

0.5

0.6

0.7

0.75

0.8

0.9

1

10 15 20 25 30 35 40 45 50 Very large

878 802 768 748 736 726 720 714 710 676

610 558 534 520 512 506 500 498 494 470

244 316 302 294 288 286 282 280 280 266

222 202 194 190 186 184 182 180 180 172

154 142 136 132 130 128 128 126 126 120

114 104 100 98 96 94 94 94 92 88

100 92 88 86 84 84 82 82 82 78

88 80 78 76 74 74 72 72 72 68

70 64 62 60 60 58 58 58 58 54

58 52 50 50 48 48 48 48 46 44

Source: Based on Julious SA and Owen RJ (2006).

STATISTICAL TABLES

491

Table T8 Optimal values of Pilot, Main and hence Overall two equal-group comparative trial size for UCL80%, UCL90% and NCT inflation methods for a given standardised difference, Δ (from a combined iterative solution of Equations (19.3) and (19.4) to minimise (19.5)) Standardised Difference ΔPlan = δPlan/σ Plan

Inflation method UCL: γ = 0.8 (80%) Pilot

Main trial power: 1 − β = 0.8 0.20 78 0.30 48 0.40 36 0.50 28 0.60 24 0.70 20 0.80 20 0.90 20 1.00 20

UCL: γ = 0.95 (95%)

NCT

Main

Overall

Pilot

Main

Overall

Pilot

Main

Overall

(80%) 914 426 250 166 120 90 70 58 22

992 474 286 194 144 110 90 74 64

122 76 56 44 36 32 28 26 22

986 468 278 188 138 106 84 70 58

1108 544 334 232 174 138 112 96 80

40 28 22 20 20 20 20 20 20

824 376 216 140 98 72 56 44 36

864 404 238 160 118 92 76 64 56

1298 616 368 248 184 142 112 94 80

144 90 66 52 42 36 32 28 26

1294 610 362 244 178 136 108 88 74

1438 700 428 296 220 172 140 116 100

56 38 30 24 22 20 20 20 20

1104 504 290 190 136 100 78 62 50

1160 542 320 214 158 120 98 82 70

Main trial power: 1 − β = 0.90 (90%) 0.20 92 1206 0.30 58 558 0.40 42 326 0.50 32 216 0.60 28 156 0.70 24 118 0.80 20 92 0.90 20 76 1.00 20 60

Source: Based on Whitehead, Julious, Cooper, et al. (2015).

Glossary

The majority of the definitions within this glossary are based on, but are only a selection from, the comprehensive list provided by S Day (2007) Dictionary of Clinical Trials. 2nd edition, Wiley, Chichester. In respect to some of the definitions included here, we add an explanatory comment while some are also amplified in appropriate sections within the main text. adaptive design trial procedures that change as the trial progresses. An example is that of the randomisation process changing as the trial progresses and the results become known. Such designs are used so that, if it appears that one intervention is emerging as superior to another, the allocation ratio can be biased in favour of the intervention which seems to be best. adaptive threshold design (ATD) a trial in which a potential biomarker is insufficiently understood and for which the groups defining those (if any) who will, and those who will not, benefit from the treatment under investigation are to be established. adjust to modify the treatment effect (the adjusted estimate) to account for differences in patient characteristics (usually only when known to be importantly prognostic for outcome) between intervention groups. allocation ratio in a parallel group trial the ratio of the number of subjects allocated to one intervention group relative to the number allocated to another. Most often the ratio is 1 : 1, or equal allocation. arm synonym for group (as in randomised group). assigned intervention the intervention that a trial participant is due to receive based on a randomisation procedure. attrition loss; often used to describe loss of patients’ data in long-term trials due to patients withdrawing for reasons other than those of meeting the trials’ primary endpoint. audit trail a list of reasons and justifications for all changes that are made to data or documents, and of all procedures that do not comply with agreed trial procedures. Such an audit trail particularly applies to information that is entered onto the trial database so that if an item is amended (including deletion) in any way then the original entry is retained, the amendment noted and the date and individual responsible for the change also noted and automatically stored within the database. auto-correlation correlation between repeated measurements taken successively in time on the same subject.

Randomised Clinical Trials: Design, Practice and Reporting, Second Edition. David Machin, Peter M. Fayers, and Bee Choo Tai. © 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.

494

GLOSSARY

baseline characteristic a measurement taken on a subject at the beginning of a trial. Note that ‘beginning’ is generally taken to be at, or as near as possible to (ideally before), the time of randomisation. Bayesian general statistical methods based around Bayes’ theorem which describes a way of moving the thinking about the probability of data, given a hypothesis, to the probability of a hypothesis being true or false. posterior distribution in Bayesian statistics, this is the probability distribution of the treatment effect after combining the prior distribution of the treatment effect with that of the trial data itself. prior distribution in Bayesian statistics, this is the probability distribution of the treatment effect (usually obtained before the trial commences) before combining it with the trial data to obtain the posterior distribution. bias a process which systematically overestimates or underestimates a parameter. publication bias the situation where there is a tendency for positive trials to be more widely published (or otherwise reported) than negative trials. This is a particular problem and can cause bias in carrying out overviews and meta-analyses. recall bias any bias that arises from remembering events. response bias any bias that is caused by a systematic difference between those people who respond (typically to a questionnaire) and those who do not respond. selection bias the bias caused by the fact that the types of participants who take part in trials are not a random sample of the population from which they are drawn. biomarker an objective measurement (often a laboratory measurement) that is an indicator (or ‘marker’) of a clinical condition or of a response to treatment. biomarker stratified design a design in which patients with different biomarker indications are allocated to treatment using stratified randomisation according to their biomarker status. blind or mask where the investigator, subject (and possibly other people) are not able to the treatment that are being compared (by sight, smell, taste, weight, etc). single blind the case when only one party is blind to the treatment allocation. It is usually either the investigator or the subject who is blind but the term on its own does not differentiate. double blind a trial where the subjects and investigators are blind to the treatment or intervention allocation. triple blind the situation where the subject is blinded to the trial medication, the investigator and the data management staff are also blinded. quadruple blind subjects are blind to what medication they receive; those administering the treatments are blind to what treatment they are giving each subject, investigators (or those assessing efficacy and safety) are blind to which treatment was given, and data management and statistical personnel are also blind to which treatment each subject received. The only use over that of triple blind is when the person who gives the treatment to the subject is not the same as the one who subsequently assesses the effect of the treatment. It should be noted that these definitions are not consistently adhered to in articles describing randomised trials so that the reader needs to check precisely what is intended by the author’s use of these terms. block randomisation a randomisation scheme that uses blocks, each block usually having the same number of treatments (although in random order) as other blocks. Commonly, if a trial is comparing two treatments, each block might comprise of four patients, two on one treatment two on the other. In general, block size will depend on the number of interventions within the trial and the corresponding allocation ratio.

GLOSSARY

495

Bonferroni correction an adjustment made when interpreting multiple tests of statistical significance that all address a similar basic question. If two endpoints have been assessed separately, instead of considering whether a p-value is less than (or greater than) 0.05, the calculated p-value would be compared to 0.025. In general, if k p-values have been calculated, the declaration of statistical significance would not be made unless one or more of those p-values is less than 0.05/k. An equivalent approach, and one more readily understandable, is to multiply each calculated pvalue by k and only declare those adjusted values less than 0.05 as statistically significant. censored observation when the time until an event (typically cure, recurrence of symptoms or death) is the data value to be recorded and that event has not yet been observed for a particular subject, then the data value is said to be censored. cluster design a design in which individual subjects are not randomised to receive different interventions, but rather groups (or ‘clusters’) of subjects are randomised. Examples are most common in community intervention trials. confidence interval (CI) a range of values for a parameter (such as the difference in means or proportions between two treatment groups) that are all consistent with the observed data. The width of such an interval can vary, depending on how confident we wish to be that the range quoted will truly encompass the value of the parameter. Usually ‘95% confidence intervals’ are quoted. These intervals will, in 95% of repeated cases, include the true value of the parameter of interest. In this case, the confidence level is said to be 95% (or 0.95). confounded ‘cannot be distinguished from’. For example, if all males were given one treatment and all females given an alternative, the effects of treatment and gender would be indistinguishable from one another, or confounded with each other. Similarly, in a stepped wedge design, period and time are confounded. partially confounded the situation where two variables are not completely confounded. In a stepped wedge design, period and arm are partially confounded. consistency check an edit check on data to ensure that two (or more) data values could happen in conjunction. Systolic blood pressure measurements must always be at least as great as diastolic measurements so, for any given patient, if the systolic pressure is greater than the diastolic, then the two measures are consistent with each other. It may be that neither is correct – but they are, at least, consistent. CONSORT a set of guidelines, adopted by many leading medical journals, describing the way in which clinical trials should be reported. It stands for Consolidation of the Standards of Reporting Trials. Details can be found at http://www.consort-statement.org. covariate a variable that is not of primary interest but which may affect response to treatment. Common examples are subjects’ demographic data and baseline assessments of disease severity. cross-over design a trial where each subject receives (in a random sequence) each trial medication. After receiving Treatment A, they are ‘crossed over’ to receive Treatment B (or vice versa). This is the simplest form of cross-over design and is called the two-period cross-over design. carryover a term used mostly in the context of cross-over trials where the effect of the drug is still present after that drug has ceased to be given to a subject, and in particular when that subject is taking another drug. period the intervals of time when a subject is given the first treatment (period 1), when they are given the second treatment (period 2). wash-out the process of allowing time for drugs to be naturally excreted from the body.

496

GLOSSARY

data and safety monitoring board (DSMB) a group of people who regularly review the accumulating data in a trial with the possibility of stopping the trial or modifying its progress. A trial may be stopped, or changes made to it, if clear evidence of efficacy is seen or if adverse safety is observed in one or more of the intervention groups, or for futility. diagnostic test a test (physical, mental or, more commonly, chemical or biological) that is used to assess whether or not a subject has a particular disease biomarker. dropout or withdrawal the case where a subject stops participating in a trial before he or she is due according to the trial protocol. effect a relative measure such as the extra change in blood pressure produced by one treatment over that produced by the comparator treatment. effect size or standardised effect size strictly this should simply be the size of an effect but conventionally it is taken to be the size of the effect divided by the standard deviation of the measurement. Thus, an effect size of 0.5 indicates a difference between two means equal to half of a standard deviation. endpoint a variable that is one of the primary interests in the trial. The variable may relate to efficacy or safety. The term is used almost synonymously with efficacy variable or safety variable. composite endpoint an overall endpoint in a trial made up of more than one component. An example might be the endpoint of ‘all-cause mortality or myocardial infarction’. intermediate endpoint an endpoint that does not measure exactly what we want to know but which is a second-best alternative. primary endpoint the most important endpoint in a trial. secondary endpoint one of (possibly several) less important endpoints in a trial. surrogate endpoint a substitute endpoint. A variable that is a substitute for the most clinically meaningful endpoint. In hypertension trials, the most important endpoint would usually be mortality (possibly restricted to cardiovascular reasons) but raised blood pressure would often be the endpoint that is measured. As such blood pressure is being used as a surrogate for mortality. enrichment design a design in which patients are not included in the trial when there is compelling evidence, from a particular biomarker, that they will not benefit from the treatment, under investigation. equipoise the state of having an indifferent opinion about the relative merits of two (or more) alternative treatments. Ethically, a subject should only be randomised into a trial if the treating physician has no clear evidence that one treatment is superior to another. If such evidence does exist then it is considered unethical to randomly assign a treatment. If the physician is in a state of equipoise, then randomisation is considered ethical. equivalence trial a trial whose primary aim is to demonstrate that interventions are equivalent with regard to certain specified parameters. equivalence margin the difference between test treatment and control treatment (in terms of the primary endpoint) that is deemed to be of no clinical importance. Hence, if a new treatment were this much better, or worse, than an old treatment, then the new and old treatments would be considered, in all practical extents, to be just as efficacious. exclusion criteria reasons why a subject should not be enrolled into a trial. These are usually reasons of safety and should not simply be the opposites of inclusion criteria. factor another name for a categorical variable, usually one that is a covariate or a stratification variable, rather than one that is an outcome variable. factor level one of the different values that a factor can take. For example, a placebo may be the zero dose (level) of a drug under test in a trial.

GLOSSARY

497

factorial design a trial that compares two (or more) different sets of interventions (factors). The simplest design uses Drug A versus Placebo A and Drug B versus Placebo B. Subject will be randomised to one of four groups: Placebo A + Placebo B, Drug A + Placebo B, Placebo A + Drug B or Drug A + Drug B. This is a very efficient type of design because it not only allows the assessments of Drug A and Drug B in one trial instead of two but also allows us to investigate the question of whether drugs A and B show any interaction. incomplete factorial design a factorial design where not all combinations of the possible treatments are used. interaction effect the difference in the size of the effect caused by two or more factors (treatment types) jointly, compared with the sum of the individual effects of each. main effect in factorial trials, the main effect of one factor is the size of the effect averaged over all levels of all other factors. feasibility study a small study for helping to design a subsequent trial. The main uses of feasibility studies are to test practical arrangements (for example, how long do various activities take? Is it possible to do all the things we want to?), to test questionnaires (do the subjects understand the questions in the way intended). pilot study. fixed sample size design a design that determines the number of subjects to be recruited before the trial starts and does not allow the number to be changed. This is the most common type of approach to determining how many subjects should be in a trial. forest plot a diagram comprising the individual estimated treatment effects and associated 95% confidence intervals for each trial included in a meta-analysis, together with an overall estimate and confidence interval. futility analysis usually an interim analysis of a trial to determine whether the trial objectives are likely to be achieved. Continuation of a trial may be considered futile if there is little chance to detect a clinically useful treatment effect. genomic medicine study of how genetic variations can influence disease. Good Clinical Practice (GCP) a set of principles and guidelines to ensure high quality and high standards in clinical trials research. hypothesis a statement for which good evidence may not exist but which is to be the subject of a clinical trial. alternative hypothesis (H1) this is usually the point of interest in a trial. It is generally phrased in terms of the null hypothesis (of no treatment effect) not being true. If the objective of a trial is to ‘compare Drug A with placebo’ then the null hypothesis would be that there is no difference between the two treatments and the alternative hypothesis would be that there is a difference. null hypothesis (H0) the assumption, generally made in statistical significance testing, that there is no difference between groups (in whatever feature is being compared). Evidence (in the form of trial data) is then sought to refute this null hypothesis. inclusion criteria the requirements that a subject must fulfil to be allowed to enter a trial. These are usually selected to ensure that the subject has the appropriate disease and that he or she is the type of subject that the researchers wish to study. Inclusion criteria should not simply be the opposites of the exclusion criteria. informed consent the practice of explaining to subjects and informing them about the purpose of the trial and seeking their agreement to participate on a voluntary basis. e-consent the use of the Internet (possibly in face-to-face format) to inform and obtain consent from the subject.

498

GLOSSARY

intention-to-treat (ITT) it is a strategy for analysing trial data which says that any subject randomised to treatment must be included in the analysis under the arm assigned at randomisation. interim analysis a formal statistical term indicating an analysis of data part way through a trial. International Council on Harmonisation (ICH) a group of industry and regulatory representatives from Europe, Japan and the USA who draw up common standards of required test and documentation for drug licensing in these three regions of the world. The web site for ICH is http://www.ich.org. large simple trial a trial that enrols many patients (perhaps 20 000, or more) and whose procedures and documentation are kept as simple (and minimal) as possible. meta-analysis an analysis of the results from two or more similar trials. Such methods are used as a way of synthesising data from a variety of trials to try and get better answers to specific medical questions. missing data or forms a data value or form that should have been recorded or completed but for some reason was not. informative missing the situation where missing data tell us something about the effect of treatment. Usually the ‘informative’ nature is detrimental to the treatment (for example, efficacy data missing because a patient has died) but sometimes the information can be more positive (for example, no further follow-up made as the patient was cured). missing at random missing data where the probability of data being missing may depend on the values of some other measured data but does not depend on the missing values themselves. model an idealistic description of a real (often uncertain) situation. additive model a statistical model where the combined effect of separate variables contributes as the sum of each of their separate effects. statistical model equation including both deterministic and random elements which describes how a process behaves. Examples are regression models and logistic regression. In fact, simple t-tests are based on simple models. noninferiority trial a trial whose objective is to show that one intervention ‘is not worse than another’. This is subtly different from an equivalence trial, which aims to show that two treatments are equivalent, and is obviously different to trying to show that one treatment is different to another (superiority trial). number needed to treat (NNT) the number of patients that a physician would have to treat with a new treatment in order to avoid one event that would otherwise have occurred with a standard treatment. open trial a trial where the treatments are not blinded. overview to look at trial or trial related data from various sources, considering them as a whole and drawing a conclusion. paired design a trial design that involves taking paired observations and usually makes treatment comparisons using paired comparisons, often in the form of cross-over or split-mouth designs. parallel group design the most common design for clinical trials, whereby subjects are allocated to receive one of several interventions. All subjects are independently allocated to one of the treatment groups. No subject receives more than one of the treatments. parameter the true (but often unknown) value of some characteristic. The most common parameter that we wish to estimate in clinical trials is the size of the treatment effect. per protocol analysis the analysis of trial data that includes only those subjects from all those randomised who adequately comply with the trial protocol.

GLOSSARY

499

pilot study a small study for helping to design a subsequent trial. feasibility study. external pilot study a form of pilot study in which planning assumptions with respect to, for example, the anticipated standard deviation or a response rate are investigated to assist in the sample size calculation for the intended trial. These data do not form part of the data for the eventual trial which is conducted. internal pilot study a form of pilot study in which the collected data also form part of the data for the main trial. When designing a trial, the sample size determined may hinge on critical assumptions such as the likely response rate in the standard or control arm or, if a continuous outcome measure is under consideration, the likely standard deviation. In such cases, an internal pilot study may be conducted, which comprises a predetermined number of the first patients recruited to the trial. Information from these patients is then used, to calculate, in our examples, either the response rate or standard deviation. If these are far from those used at the planning stage of the trial, they are used to recalculate trial size. However, the trial size is then only amended if this new size exceeds the original target. placebo an inert substance usually prepared to look, smell, taste, etc as similar to the active product being investigated in a trial as possible. PRISM preferred reporting items for systematic reviews and meta-analyses available at http:// prisma-statement.org/. protocol a written document describing all the important details of how a trial will be conducted. It will generally include details of the interventions being used, a rationale for the trial, what procedures will be carried out on the subjects in the trial, how many subjects will be recruited, the design of the trial and how the data will be analysed. protocol violation something that happens during the trial (usually to one or more of the recruited subjects) that does not fully conform to what was described in the protocol. randomisation the process of allocating subjects to interventions in a manner not happening systematically or predictably. randomisation list a list, produced by a random process, that tells which subjects will receive which intervention in a randomised trial. dynamic allocation a randomisation method that changes the probability of assignment to different treatment groups as a trial progresses. minimisation a pseudo random method of assigning treatments to subjects to try to balance the distribution of covariates across the treatment groups. Zelen’s randomised consent design a design which combines randomisation with consent. Subjects are randomised to one of two treatment groups. Those who are randomised to the standard treatment are all treated with it (no consent to take part in the trial is sought). Those who are randomised to the experimental treatment are asked for their consent; if they agree they are treated with the experimental treatment; if they disagree they are treated with the standard treatment. The analysis must be based on the treatment to which patients were randomised, not the treatment they actually receive. range check an edit check to identify any data values that fall outside a specified lower limit or upper limit. repeated measurements design a trial in which subjects have several measurements of the same variable taken at different times. seed is a number used to initialise a computerised pseudo random number generator which is used to create the randomisation code to decide which subjects receive which intervention. A pseudo random number generator provides numbers that appear as if they are completely random but which (technically) are not. Current regulatory requirements specify that the

500

GLOSSARY

randomisation list can be reproduced at any time so that knowledge of the seed ensures this can be done. Note that tossing a coin is random, but the sequence so generated is not reproducible and so its veracity cannot be confirmed. selection bias the bias caused by the fact that the types of subjects who take part in trials are not a random sample of the population from which they are drawn. Specifically, if the random treatment allocation process is not adhered to in that the allocation is known before eligibility is determined (implying that the patients may be selected for the treatment rather than the treatment selected for the patient) then this results in selection bias which may compromise the final evaluation of the alternatives. sequential design a general type of trial design, in which subjects are recruited and the accumulating data analysed after every subject has completed the trial. The analysis does not wait until a fixed number of subjects have completed the trial. The trial continues to recruit until a positive or negative result becomes evident. closed sequential design a sequential trial design where an upper limit on the number of subjects exists (hence ‘closed’) but it is possible to draw conclusions and stop the trial before that number of subjects have been recruited. group sequential design a form of sequential design where interim analyses are carried out after a number of subjects have been recruited into a trial. Usually only two or three analyses would be planned in such a trial after either half the subjects or one-third and two-thirds of the subjects have completed the trial. open sequential design a sequential trial design that does not have any upper limit to the number of subjects that may be recruited. serious adverse event (SAE) a regulatory term with a strict meaning. It includes all adverse events that result in death, are life threatening, require in patient hospitalisation or prolongation of existing hospitalisation, or result in disability or congenital abnormality. Note some of them could be nonserious (and quite routine) in a medical sense. sham an alternative term for a placebo but used particularly when the active form of the treatment is not a conventional tablet, capsule, etc. Examples of sham treatments have included sham acupuncture and even sham surgery. split-mouth design a trial where each subject receives each trial medication. Treatment A to one side of the oral cavity, Treatment B to the other. carryacross a term used in oral cavity trials where, for example, one side of the mouth receives one drug whose effect may modify (contaminate) the effect of a second drug applied to the other side (and vice versa). stepped wedge design a cluster trial in which assessments are repeated over periods of time. The successive periods characterise the cells of the corresponding cluster. In many instances, all clusters (the first cell) initially receive the control, in the next period some continue with control, while others are randomised to the intervention of concern. Once that period is complete, those clusters allocated to the intervention continue with the intervention in the next cell, while a proportion of those currently receiving control are randomised to intervention for the next cell while the remainder continue with control. This process continues over time until the final period when all clusters (in their final cell) receive the intervention. closed cohort design the same group of subjects are recruited within each cell of every cluster within the stepped wedge design. open cohort or cross-sectional design a different group of subjects are recruited within each cell of every cluster within the stepped wedge design. stopping rule the rule for deciding when to stop recruitment to a trial. This may be based on formal statistical considerations or be more informal.

GLOSSARY

501

early stopping the practice of stopping recruitment into a trial before reaching the target sample size or of stopping follow-up of a trial before the intended final duration. This may be in a sequential trial, after a formal interim analysis or for purely practical reasons that are independent of efficacy or safety concerns. stratify to divide those entering a clinical trial into groups according to the values of a categorical variable. stratified randomisation the use of separate randomisation lists for different strata in the sample. This is often done to ensure that possible important prognostic factors are balanced across the intervention groups. superiority trial a trial where the objective is to show that one intervention is better than another. systematic review a thorough and complete review and assessment of all the published and unpublished literature and information available. treatment difference or effect the difference between two treatment groups based on one of the trial endpoints. This might commonly be the difference in mean response, difference in proportions of responders, difference in median survival times. It may be expressed through the use of an odds ratio or hazard ratio as appropriate. clinically significant difference a treatment effect that is sufficiently large to be useful for treating patients. variable the mathematical term for a characteristic or property of something or someone that is being measured. It may vary from time to time and from subject to subject. dependent variable in a statistical model, the dependent variable is the one we are trying to predict from the independent variable(s). design variable any variable that contributes to the design of a trial, often because of stratification according to values of the variable. independent variable, explanatory variable or covariate in a regression model, the dependent variable may depend on the independent variables; the most important of which is the assigned treatment group. Other important independent variables are those thought prognostic for outcome. Note, confusingly, that several so-called independent variables may not be independent of each other. prognostic factor a factor (often a feature of the patient at diagnosis rather than the treatment received) that is predictive of outcome.

References [Chapter numbers in square brackets] Abdullah K, Thorpe KE, Mamak E, Maguire JL, Birken CS, Fehlings D, Hanley AJ, Macarthur C, Zlotkin SH and Parkin PC (2015). An internal pilot study for a randomized trial aimed at evaluating the effectiveness of iron interventions in children with non-anemic iron deficiency: the OptEC trial. Trials, 16, 303. [19] Adams G, Gulliford MC, Ukoumunne OC, Eldridge S, Chinn S and Campbell MJ (2004). Patterns of intra-cluster correlation from primary care research to inform study design and analysis. Journal of Clinical Epidemiology, 57, 785–794. [16] Advanced Ovarian Trialist Group (1991). Chemotherapy in advanced ovarian cancer: an overview of randomised clinical trials. BMJ, 303, 884–893. [20] Allan L, Hays H, Jensen N-H, de Waroux BLP, Bolt M, Donald R and Kalso E (2008). Randomised crossover trial of transdermal fentanyl and sustained release oral morphine for treating chronic non-cancer pain. BMJ, 322, 1–7. [15] Al-Sarraf M, Pajak PF, Cooper JS, Mohiuddin M, Herskovic A and Ager PJ (1990). Chemoradiotherapy in patients with locally advanced nasopharyngeal carcinoma: a radiation therapy oncology group study. Journal of Clinical Oncology, 8, 1342–1351. [2] Altman DG (2000). Clinical trials and meta-analyses. In Altman DG, Machin D, Bryant TN and Gardner MJ (eds). Statistics with Confidence. (2nd edn). BMJ Books, London, 124–127. [8] Altman DG (2018). Avoiding bias in trials in which allocation ratio is varied. Royal Society of Medicine, 111, 143–144. [11] Altman DG, Whitehead J, Parmar MKB, Stenning SP, Fayers PM and Machin D (1995). Randomised consent designs in cancer clinical trials. European Journal of Cancer, 31A, 1934–1944. [20] Amaro CM, Bello JA, Jain D, Ramnath A, D’Ugard C, Vanbuskirk S, Bancalari E and Claure N (2018). Early caffeine and weaning from mechanical ventilation in preterm infants: a randomized, placebo-controlled trial. Journal of Pediatrics, 196, 52–57. [10] Ang ES-W, Lee S-T, Gan CS-G, See PG-J, Chan Y-H, Ng L-H and Machin D (2001). Evaluating the role of alternative therapy in burn wound management: randomized trial comparing Moist Exposed Burn Ointment with conventional methods in the management of patients with second-degree burns. Medscape General Medicine, 3, 3. [2, 14] Ang ES-W, Lee S-T, Gan CS-G, Chan Y-H, Cheung Y-B and Machin D (2003). Pain control in a randomized controlled trial comparing Moist Exposed Burn Ointment (MEBO) and

Randomised Clinical Trials: Design, Practice and Reporting, Second Edition. David Machin, Peter M. Fayers, and Bee Choo Tai. © 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.

504

REFERENCES

conventional methods in patients with partial thickness burns. Journal of Burn Care and Rehabilitation, 24, 289–296. [4] Arrow P and Riordan PJ (1995). Retention and caries preventive effects of a GIC and a resinbased fissure sealant. Community Dentistry and Oral Epidemiology, 23, 282–285. [13] Assmann SF, Pocock SJ, Enos LE and Kasten LE (2000). Subgroup analysis and other (mis)uses of baseline data in clinical trials. Lancet, 355, 1064–1069. [8] Begg C, Cho M, Eastwood S, Horton R, Moher D, Olkin I, Pitkin R, Rennie D, Schultz KF, Simel D and Stroup DF (1996). Improving the quality of reporting randomized controlled trials: the CONSORT statement. Journal of the American Medical Association, 276, 637–639. [11] Bellary S, O’Hare JP, Raymond NT, Gumber A, Mughal S, Szczepura A, Kumar S and Barnett AH (2008). Enhanced diabetes care to patients of south Asian ethnic origin (the United Kingdom Asian Diabetes Study): a cluster randomised trial. Lancet, 371, 1769–1776. [16] Berkowitz ST (1998). Prerandomization of clinical trials: a more ethical way of performing cleft palate research. Plastic and Reconstructive Surgery, 102, 1724. [20] Besselinik MGH, van der Graaf Y and Gooszen HG (2010). Interim analysis in randomized trials: DAMOCLES’ sword? Journal of Epidemiology, 63, 353–354. [10] Birkett MA and Day SJ (1994). Internal pilot studies for estimating sample size. Statistics in Medicine, 13, 2455–2463. [10, 19] Bladé J, Samson D, Reece D, Apperley J, Björkstrand B, Gahrton G, Gertz M, Giralt S, Jagannath S and Vesole D (1998). Criteria for evaluating disease response and progression in patients with multiple myeloma treated by high-dose therapy and haemopoietic stem cell transplantation. Myeloma Subcommittee of the EBMT. European Group for Blood and Marrow Transplant. British Journal of Haematology, 102, 1115–1123. [4] Boutron I, Moher D, Altman DG, Schultz KF and Ravaud P (2008). Extending the CONSORT statement to randomized trials of nonpharmacologic treatment: explanation and elaboration. Annals of Internal Medicine, 148, 295–309. [11] Bowman SJ, Everett CC, O’Dwyer JL, Emery P, Pitzalis C, Ng W-F, Pease CT, Price EJ, Sutcliffe N, Gendi NST, Hall FC, Ruddock SP, Fernadez C, Reynolds C, Hulme CT, Davies KA, Edwards CJ, Lanyon PC, Moots RJ, Roussou E, Giles IP, Sharples LD and Bombardieri M (2017). Randomized controlled trial of rituximab and cost-effectiveness analysis in treating fatigue and oral dryness in primary Sjögren’s syndrome. Arthritis & Rheumatology, 69, 1440–1450. [14] Bradburn MJ, Clark TG, Love SB and Altman DG (2003). Survival analysis part III: multivariate data analysis – choosing a model and assessing its adequacy and fit. British Journal of Cancer, 89, 605–611. [11] Brandes AA, Vastola F, Basso U, Berti F, Inna G, Rotilio A, Gardiman M, Scienza R, Monfardini S and Ermani M (2003). A prospective study of glioblastoma in the elderly. Cancer, 97, 657– 662. [1] Broderick JP, Palesch YY, Demchuk AM, Yeatts SD, Khatri P, Hill MD, Jauch EC, Jovin TG, Yan B, Silver FL, von Kummer R, Molina CA, Demaerschalk BM, Budzik R, Clark WM, Zaidat OO, Malisch TW, Goyal M, Schonewille WJ, Mazighi M, Engelter ST, Anderson C, Spilker J, Carrozzella J, Ryckborst KJ, Janis LS, Martin RH, Foster LD, and Tomsick TA (2013). Endovascular therapy after intravenous t-PA versus t-PA alone for stroke. The New England Journal of Medicine, 368, 893–903. [10] Brown J, McElvenny D, Nixon J, Bainbridge J and Mason S (2000). Some practical issues in the design, monitoring and analysis of a sequential randomized trial in pressure sore prevention. Statistics in Medicine, 19, 3389–3400. [4, 20]

REFERENCES

505

Browne RH (1995). On the use of a pilot study for sample size determination. Statistics in Medicine, 14, 1933–1940. [19] Bryant TN (2000). Computer software for calculating confidence intervals (CIA). In Altman DG, Machin D, Bryant TN and Gardner MJ (eds). Statistics with Confidence. (2nd edn). British Medical Journal, London, 208–213. [11] Burgess IF, Brown CM and Lee PN (2005). Treatment of head louse infestation with 4% dimeticone lotion: randomised controlled equivalence trial. BMJ, 330, 1423. [15] Calverley P, Pauwels R, Vestbo J, Jones P, Pride N, Gulsvik A, Anderson J and Maden C (2003). Combined salmeterol and flucticasone in the treatment of chronic obstructive pulmonary disease: a randomised controlled trial. Lancet, 361, 449–456. [12] Campbell G, Pickles T and D’yachkova Y (2003). A randomised trial of cranberry versus apple juice in the management of urinary symptoms during external beam radiation therapy for prostate cancer. Clinical Oncology, 15, 322–328. [5] Campbell M, Fitzpatrick R, Haines A, Kinmonth AL, Sandercock P, Spiegelhalter D and Tyrer P (2000). Framework for design and evaluation of complex interventions to improve health. BMJ, 321, 694–696. [1] Campbell MJ and Gardner MJ (2000). Medians and their differences. In Altman DG, Machin D, Bryant TN, and Gardner MJ (eds). Statistics with Confidence. (2nd edn). British Medical Journal, London, 36–56. [8] Campbell MJ and Walters SJ (2014). How to Design, Analyse and Report Cluster Randomised Trials in Medicine and Health Related Research. Wiley, Chichester. [1, 16] Campbell MK, Fayers PM and Grimshaw JM (2005). Determinants of the intracluster correlation coefficient in cluster randomized trials: the case of implementation research. Clinical Trials, 2, 99–10. [16] Campbell MK, Piaggio G, Elbourne DR and Altman DG (2012). CONSORT 2010 statement: extension to cluster randomised trials. BMJ, 345, e5661. [11, 16] Chan A-W, Tetzlaff JM, Gøtzsche PC, Altman DG, Mann H, Berlin JA, Dickersin K, Hróbjartsson A, Schultz KF, Parulekar WR, Krleža-Jeric K, Laupacis A and Moher D (2013). SPIRIT 2013 explanation and elaboration: guidance for protocols of clinical trials. BMJ, 346, e7586. [3] Chen ZM, Pan HC, Chen YP, Peto R, Collins R, Jiang LX, Xie JX and Liu LS (2005). Early intravenous then oral metoprolol in 45,852 patients with acute myocardial infarction: randomised placebo-controlled trial. Lancet, 366, 1622–1632. [20] Chow PK-H, Tai B-C, Tan C-K, Machin D, Khin M-W, Johnson PJ and Soo K-C (2002). Highdose tamoxifen in the treatment of inoperable hepatocellular carcinoma: a multicenter randomized controlled trial. Hepatology, 36, 1221–1226. [2, 3, 5, 11, 12] Chung AYF, Ooi LLPJ, Machin D, Tan SB, Goh BKP, Wong JS, Chen YM, Li PCN, Gandhi M, Thng CH, Yu SWK, Tan BS, Lo RHG, Htoo AMM, Tay KH, Sundram FX, Goh ASW, Chew SP, Liau KH, Chow PKH, Tay KH, Tan YM, Cheow PC, Ho CK and Soo KC (2013). Adjuvant hepatic intra-arterial iodine-131-lipiodol following curative resection of hepatocellular carcinoma: A prospective randomized trial. World Journal of Surgery, 37, 1356–1361. [10, 20] Clarke M, Hopewell S and Chalmers I (2007). Reports of clinical trials should begin and end with up-to-date systematic reviews of the other relevant evidence: a status report. Journal of the Royal Society of Medicine, 100, 187–190. [20] Cohen J (1988). Statistical Power Analysis for the Behavioral Sciences. (2nd edn). Lawrence Erlbaum, New Jersey. [9, 11, 12, 13, 16, 19]

506

REFERENCES

Collados-Gómez L, Ferrara-Camacho P, Fernandez-Serrano E, Camacho-Vincete V, FloresHerrero C, García-Pozo AM and Jiménez-García R (2018). Randomised crossover trial showed that using breast milk or sucrose provided the same analgesic effect in preterm infants of at least 28 weeks. Acta Paediatrica, 107, 436–441. [13, 15] Comi G, Pulizzi A, Rovaris M, Abramsky O, Arbizu T, Boiko A, Gold R, Havrdova E, Komoly S, Selmaj KW, Sharrack B and Filippi M (2008). Effect of laquinimod on MRI-monitored disease activity in patients with relapsing-remitting multiple sclerosis: a multicentre, randomised, double-blind, placebo-controlled phase IIb study. Lancet, 371, 2085–2092. [4, 11] Copas AJ, Lewis JJ, Thompson JA, Davey C, Baio G and Hargreaves JR (2015). Designing a stepped wedge trial: three main designs, carry-over effects and randomisation procedures. Trials, 16, 352. [17] Cornelis J, Barakat A, Dekker J, Schut T, Berk S, Nusselder H, Ruhl N, Zoeteman J, Van R, Beekman A and Blankers M (2018). Intensive home treatment for patients in acute psychiatric crisis situations: a multicentre randomized controlled trial. BMC Psychiatry, 18, 55. [20] Cox DR (1972). Regression models and life tables (with discussion). Journal of the Royal Statistical Society B, 34, 187–220. [1] Craig P, Dieppe P, Mcintyre S, Michie S, Nazareth I and Petticrew M (2008). Developing and evaluating complex interventions: the new Medical Research Council guidance. BMJ, 337, a1655. [17] Csendes A, Burdiles P, Korn O, Braghetto I, Huertas C and Rojas J (2002). Late results of a randomized clinical trial comparing total fundoplication versus calibration of the cardia with posterior gastropexy. British Journal of Surgery, 87, 289–297. [5] Cuschieri A, Weeden S, Fielding J, Bancewicz J, Craven J, Joypaul V, Sydes M and Fayers P (1999). Patient survival after D1 and D2 resections for gastric cancer: long-term results of the MRC randomized surgical trial. British Journal of Cancer, 79, 1522–1530. [9] DAMOCLES Study Group (2005). A proposed charter for clinical trial data monitoring committees: helping them to do their job well. Lancet, 365, 711–722. [10, 20] Daniels LA, Mallan KM, Nicholson JM, Battistutta D and Magarey A (2013). Outcomes of an early feeding practices intervention to prevent childhood obesity. Pediatrics, 132, e109. [4] Day S (2007). Dictionary for Clinical Trials. (2nd edn). Wiley, Chichester. [1, 3, 6, 17, 19] Day S and Småstuen M (2008). An open mind on adaptive designs. GCPj, Informa UK Ltd, 19–22, December. http://www.GCPj.com. [20] De Angelis CD, Drazen JM, Frizelle FA, Haug C, Hoey J, Horton R, Kotzin S, Laine C, Marusic A, Overbeke AJPM, Schroeder TV, Sox HC and Van Der Weyden MB (2004). Clinical trial registration: a statement from the International Committee of Medical Journal Editors. Annals of Internal Medicine, 141, 477–478. [6] Del Fabbro E, Dev R, Hui D, Palmer L and Bruera E (2013). Effects of melatonin on appetite and other symptoms in patients with advanced cancer and cachexia: a double-blind placebocontrolled trial. Journal of Clinical Oncology, 31, 1271–1276. [10] DeMets DL (2006). Futility approaches to interim monitoring by data monitoring committees. Clinical Trials, 3, 522–529. [10] D’Haens G, Baert F, van Assche G, Caenepeel P, Vergauwe P, Tuynman H, De Vos M, van Deventer S, Stitt L, Donner A, Vermeire S, Van De Mierop FJ, Coche J-CR, van der Woude J, Ochsenkühn T, van Bodegraven AA, Van Hootengem PP, Lambrecht GL, Mana F, Rutgeerts P, Feagan BG and Hommes D (2008). Early combined immunosuppression or conventional management in patients with newly diagnosed Crohn’s disease: an open randomised trial. Lancet, 371, 660–667. [11]

REFERENCES

507

Dickersin K and Rennie D (2003). Registering clinical trials. Journal of the American Medical Association, 290, 516–523. [6, 11] Diggle PJ, Heagerty P, Liang KY and Zeger AL (2002). Analysis of Longitudinal Data. (2nd edn). Oxford University Press. [14] Dmitrienko A, Molenberghs G, Chuang-Stein C and Offen W (2005). Analysis of Clinical Trials using SAS®: A Practical Guide. SAS Institute Inc., Cary, NC. [20] Dowding D, Lichtner V and Closs SJ (2017). Using the MRC framework for complex interventions to develop clinical decision support: a case study. Informatics for Health: Connected Citizen-Led Wellness and Population Health. doi:10.3233/978-1-61499-753-5-544. [17] Drake-Brockman TFE, Ramgolam A, Zhang G, Hall GL and von Ungern-Sternberg BS (2017). The effect of endotracheal tubes versus laryngeal mask airways on perioperative respiratory adverse events in infants: a randomised controlled trial. Lancet, 389, 701–708. [10] Dresser GK, Nelson SA, Mahon JL, Zou G, Vandervoort MK, Wong CJ, Feagan BG and Feldman RD (2013). Simplified therapeutic intervention to control hypertension and hypercholesterolemia: a cluster randomized controlled trial (STITCH2). Journal of Hypertension, 31, 1702– 1713. [16] Drucker DJ, Buse JB, Taylor K, Kendall DM, Trautmann M, Zhuang D and Porter L for the DURATION-1 Study Group (2008). Exenatide once weekly versus twice daily for the treatment of type 2 diabetes: a randomised, open-label, non-inferiority study. Lancet, 372, 1240– 1250. [11] Dupont WD and Plummer WD (2018). PS: Power and Sample Size. v3.1.6 [email protected]. [9] Dwan K, Li T, Altman DG and Elbourne D (2019) CONSORT 2010 statement: extension to randomised crossover trials. BMJ, 366, l4378. [13] Dwyer HC, Baranowski DC, Mayer PV and Gabriele S (2018). LivRelief varicose veins cream in the treatment of chronic venous insufficiency of the lower limbs: a 6-week single arm pilot study. PLoS One, 13, e0208954. [19] Egger M, Davy-Smith G and Higgins J (2021). Systematic Reviews in Health Care: Meta-Analysis in Context. (3rd edn). BMJ Books, London [20] Elbourne DR, Altman DG, Higgins JPT, Curtin F, Worthington HV and Vail A (2002). Metaanalysis involving cross-over trials: methodological issues. International Journal of Epidemiology, 31, 140–149. [13] Eldridge SM, Chan CL, Campbell MJ, Bond CM, Hopewell S, Thabane L and Lancaster GA on behalf of the PAFS Consensus Group (2016). CONSORT 2010 statement: extension to randomised pilot and feasibility trials. BMJ, 355, i5239. [19] Eldridge SM, Lancaster GA, Campbell MJ, Thabane L, Hopewell S, Coleman CL and Bond CM (2016). Defining feasibility and pilot studies in preparation for randomised controlled trials: development of a conceptual framework. PLoS One, 11, e0150205. [19] EMEA (2005). Guideline on Clinical Trials in Small Populations. CHMP/EWP/83561/2005. Equi A, Balfour-Lynn IM, Bush A and Rosenthal M (2002). Long term azithromycin in children with cystic fibrosis: a randomised, placebo-controlled crossover trial. Lancet, 360, 978– 984. [13] Erbel R, Di Mario C, Bartunek J, Bonnier J, de Bruyne B, Eberli FR, Erne P, Haude M, Heublein B, Horrigan M, Ilsley C, Böse D, Koolen J, Lűscher TF, Weissman N and Waksman R (2007). Temporary scaffolding of coronary arteries with bioabsorbable magnesium stents: a prospective, non-randomised multicentre trial. Lancet, 369, 1869–1875. [1, 2, 9, 11]

508

REFERENCES

Fayers PM and King M (2008). The baseline characteristics did not differ significantly. Quality of Life Research, 17, 1047–1048. [11] Fayers PM and Machin D (2016). Quality of Life: The Assessment, Analysis and Interpretation of Patient-reported Outcomes. (3rd edn). Wiley, Chichester. [1, 4, 14, 20] Fayers PM, Jordhøy MS and Kaasa S (2002). Cluster-randomized trials. Palliative Medicine, 16, 69–70. [16] Filippatos GS, de Graeff P, Bax JJ, Borg J-J, Cleland JGF, Dargie HJ, Flather M, Ford I, Friede T, Greenberg B, Henon-Goburdhun C, Holcomb R, Horst B, Lekakis J, Mueller-Velten G, Papavassiliou AG, Prasad K, Rosano GMC, Severin T, Sherman W, Stough WG, Swedberg K, Tavazzi L, Tousoulis D, Vardas P, Ruschitzka F and Anker SD (2017). Independent academic data monitoring committees for clinical trials in cardiovascular and cardiometabolic diseases. European Journal of Heart Failure, 19, 449–456. [10] Fitzmaurice GM, Laird NM and Ware JH (2011). Applied Longitudinal Analysis. (2nd ed) Wiley, Hoboken. [14] Fitzpatrick S (2008a). Clinical Trial Design. ICR Publishing, Marlow. [1] Fitzpatrick S (2008b). The Clinical Trial Protocol. ICR Publishing, Marlow. [1] Fleiss JL (1986). The Design and Analysis of Clinical Experiments. Wiley, New York. [12] Flight L and Julious SA (2016). Practical guide to sample size calculations: non-inferiority and equivalence trials. Pharmaceutical Statistics, 15, 80–89. [15] Frech SA, DuPont HL, Bourgeois AL, McKenzie R, Belkind-Gerson J, Figueroa JF, Okhuysen PC, Geurrero NH, Martinez-Sandoval FG, Meléndez-Romero JH, Jiang ZD, Asturias EJ, Halpern J, Torres OR, Hoffman AS, Villar CP, Kassem RN, Flyer DC, Anderson BH, Kazempour K, Breisch SA and Glenn GM (2008). Use of patch containing heat-labile toxin from Escherichia coli against travellers’ diarrhoea: a phase II, randomised, double-blind, placebo-controlled field trial. Lancet, 371, 2019–2025. [11] Freeman JV, Walters SJ and Campbell MJ (2008). How to Display Data. BMJ Books, Oxford. [1, 8] Frickman H, Schwarz NG, Hahn A, Ludyga A, Warnke P and Podbielski A (2018). Comparing a single-day swabbing regimen with an established 3-day protocol for MRSA decolonization control. Clinical Microbiology and Infection, 24, 522–527 [15] Frison L and Pocock SJ (1992). Repeated measures in clinical trials: analysis using mean summary statistics and its implications for design. Statistics in Medicine, 11, 1685–1704. [11, 14] Frison L and Pocock SJ (1997). Linearly divergent treatment effects in clinical trials with repeated measures: efficient analysis using summary statistics, Statistics in Medicine, 16, 2855– 2872. [14] Gagnier JJ, Boon H, Rochon P, Moher D, Barnes J and Bombardier C (2006). Reporting randomized, controlled trials of herbal interventions: an elaborated CONSORT statement. Annals of Internal Medicine, 144, 364–367. [11] Gandhi M, Tan S-B, Chung AYF and Machin D (2015). On developing a pragmatic strategy for clinical trials: a case study of hepatocellular carcinoma. Contemporary Clinical Trials, 43, 252– 259. [10, 20] Gao F, Earnest A, Matchar DB, Campbell MJ and Machin D (2015). Sample size calculations for the design of cluster randomized trials: a review, Contemporary Clinical Trials, 42, 41–50. [16] Garassino MC, Martelli O, Broggini M, Farina G, Veronese S, Rulli E, Bianchi F, Bettini A, Longo F, Moscetti L, Tomirotti M, Marabese M, Ganzinelli M, Lauricella C, Labianca R, Floriani I,

REFERENCES

509

Giaccone G, Torri V, Scanni A and Marsoni S (2013). Erlotinib versis docetaxel as second-line treatment of patients with advanced non-small-cell lung cancer and wild-type EGFR tumours (Tailor): a randomised controlled trial. Lancet Oncology, 14, 981–988. [18] Gerb J and Köpcke W (2007). The new EMEA-CHMP guideline on clinical trials in small populations – methodological and statistical considerations with published examples. Biocybernetics and Biomedical Engineering, 27, 59–66. [20] Giles FJ, Kantarjian HM, Cortes JE, Garcia-Manero G, Verstovsek S, Faderl S, Thomas DA, Ferrajoli A, O’Brien S, Wathen JK, Xiao L-C, Berry DA and Estey EH (2003). Adaptive randomized study of Idarubicin and Cytarabine versus Troxacitabine and Cytarabine versus Troxacitabine and Idarubicin in untreated patients 50 years or older with adverse karyotype acute myeloid leukemia. Journal of Clinical Oncology, 21, 1722–1727. [20] Girard TD, Kress JP, Fuchs BD, Thomason JW, Schweickert WD, Pun BT, Taichman DB, Dunn JG, Pohlman AS, Kinniry PA, Jackson JC, Canonico AE, Light RW, Shintani AK, Thompson JL, Gordon SM, Hall JB, Ditus RS, Bernard GR and Ely EW (2008). Efficacy and safety of a paired sedation and ventilator weaning protocol for mechanically ventilated patients in intensive care (Awakening and Breathing Controlled trial): a randomised controlled trial. Lancet, 371, 126–134. [11] Girling DJ, Parmar MKB, Stenning SP, Stephens RJ and Stewart LA (2003). Clinical Trials in Cancer: Principles and Practice. Oxford University Press, Oxford. [1] Glaucoma Laser Trial Research Group (1995). The Glaucoma Laser Trial (GLT): 6. Treatment group differences in visual field changes. American Journal of Ophthalmology, 120, 10–22. [1] Grant S, Mayo-Wilson E, Montgomery P, MacDonald G, Michie S, Hopewell S and Moher D (2018). CONSORT-SPI 2018 Explanation and Elaboration: guidance for reporting social and psychological intervention trials. Trials, 19, 406. [11, 17] Grobler L, Siegfried N, Askie L, Hooft L, Tharyan P and Antes G (2008). National and multinational prospective trial registers. Lancet, 372, 1201–1202. [6] Haesebaert J, Nighoghossian N, Mercier C, Termoz A, Porthault S, Derex L, Gueugniaud P-Y, Bravant E, Rabilloud M and Schott A-M (2018). Improving access to thrombolysis and inhospital management times in ischemic stroke. A stepped-wedge randomized trial. Stroke, 49, 405–411. [17] Hancock MJ, Maher CG, Latimer J, McLachlan AJ, Cooper CW, Day RO, Spindler MF and McAuley JH (2007). Assessment of diclofenac or spinal manipulative therapy, or both, in addition to recommended first-line treatment for acute low back pain: a randomised controlled trial. Lancet, 370, 1638–1643. [11, 12] Haussen DC, Doppelheuer S, Schindler K, Grossberg JA, Bouslama M, Schultz M, Perez H, Hall A, Frankel M and Nogueira RG (2017). Utilization of a smartphone platform for electronic informed consent in acute stroke trials. Stroke, 48, 3156–3160. [3] Hay AD, Costelloe C, Redmond NM, Montgomery AA, Fletcher M, Hollinghurst S and Peters TJ (2008). Paracetemol plus ibuprofen for the treatment of fever in children (PITCH): randomised controlled trial. BMJ, 337, a1302. Erratum BMJ, 339, b3295. [2, 11] Haybittle J (1971). Repeated assessment of results in clinical trials of cancer treatment. British Journal of Radiology, 44, 793–797. [10] Hayes R (2018). Randomising towns to fight HIV. Significance, 15, 28–31. [16] He Y, Tan EH, Wong A, Tan CC, Wong P, Lee SC and Tai BC (2018). Improving medication adherence with adjuvant aromatase inhibitor in women with breast cancer: study protocol of a

510

REFERENCES

randomised controlled trial to evaluate the effect of short message service (SMS) reminder. BMC Cancer, 18, 727. [3] Hemming K, Taljaard M, McKenzie JE, Hooper JE, Copas A, Thompson JA, Dixon-Woods M, Aldcroft A, Doussau A, Grayling M, Kristunas C, Goldstein CE, Campbell MK, Girling A, Eldridge S, Campbell MJ, Lilford RJ, Weijer C, Forbes AB and Grimshaw JM (2018). Reporting of stepped wedge cluster randomised trials: extension of the CONSORT 2010 Statement with explanation and elaboration. BMJ, 363, k1614. [17] Hemming K, Taljaard M and Grimshaw J (2019). Introducing the new CONSORT extension for stepped-wedge cluster randomised trials. Trials, 20, 68. [17] Higgins JPT, Thomas I, Chandler J, Cumpston M, Li T, Page MJ and Welch VA (eds) (2019). Cochrane Handbook for Systematic Reviews of Interventions (Version 6). Wiley, Chichester. [1, 20] Hofman MS, Lawrentschuk N, Francis RJ, Tang C, Vela I, Thomas P, Rutherford N, Martin JM, Frydenberg M, Shakher R, Wong L-M, Taubman K, Lee ST, Hsiao E, Roach P, Nottage M, Kirkwood I, Hayne D, Link E, Marusic P, Matera A, Herschtal A, Iravani A, Hicks RJ, Williams S and Murphy DG (2020). Prostate-specific membrane antigen PET-CT in patients with high-risk prostate cancer before curative-intent surgery or radiotherapy (proPSMA): a prospective, randomised, multi-center study. Lancet, 395, 1208–1216. [9] Hooper R, Teerenstra S, de Hoop E and Eldridge S (2016). Sample size calculation for stepped wedge and other longitudinal cluster randomized trials. Statistics in Medicine, 35, 4718– 4728. [17] Hopewell S, Clarke M, Moher D, Wager E, Middleton P and Altman DG (2008). CONSORT for reporting randomised trials in journal and conferences abstracts. Lancet, 371, 281–283. [11] Huibers MJ, Bleijenberg G, Beurskens AJ, Kant IJ, Knottnerus JA, van der Windt DA, Bazelmans E and van Schayck CP (2004). An alternative trial design to overcome validity and recruitment problems in primary care research. Family Practice, 21, 213–218. [20] Hujoel PP (1998). Design and analysis issues in split mouth clinical trials. Community Dentistry and Oral Epidemiology, 26, 85–86. [13] Hujoel PP and Loesche WJ (1990). Efficiency of split-mouth designs. Journal of Clinical Periodontology, 17, 722–728. [13] ICH E2A (1995). Clinical Safety Data Management: Definitions and Standards for Expedited Reporting. CPMP/ICH/377/95. [6] ICH E2B (R3) (2013). Electronic Transmission of Individual Case Safety Reports (ICSRs). EMA/ CPMP/ICH/287/1995. [6] ICH E3 (1996). Structure and Content of Clinical Study Reports. CPMP/ICH/137/95. [6, 11] ICH E6 (R2) (2016). Guideline for Good Clinical Practice. EMA/CHMP/ICH/135/1995. [1, 3, 4, 5, 6] ICH E8 (R1) (2019). General Considerations for Clinical Studies. EMA/CPMP/ICH/544570/ 1998. [6] ICH E9 (R1) (2018). Statistical Principle for Clinical Trials. CPMP/ICH/363/96. [1, 2, 3, 4, 5, 6, 9] ICH E10 (2001). Choice of Control Group in Clinical Trials. CPMP/ICH/364/96. [2] Ioannidis JP, Evans SJ, Gøtzsche PC, O’Neill RT, Altman DG, Schultz K and Moher D (2004). Better reporting of harms in randomized trials: an extension of the CONSORT statement. Annals of Internal Medicine, 141, 781–788. [11]

REFERENCES

511

James ND, Sydes MR, Clarke NW, Mason MD, Dearnaley DP, Anderson J, Popert RJ, Sanders K, Morgan RC, Stansfeld J, Dwyer J, Masters J and Parmar MKB (2008). STAMPEDE: systemic therapy for advancing or metastatic prostate cancer – a multi-arm multi-stage randomised controlled trial. Clinical Oncology, 20, 577–581. [7, 20] James ND, Sydes MR, Clarke NW, Mason MD, Dearnaley DP, Spear MR, Ritchie AWS, Parker CC, Russell JM, Attard G, de Bono J, Cross W, Jones RJ, Thalmann G, Amos C, Matheson D, Millman R, Alzouebi M, Beesley S, Birtle AJ, Brock S, Cathomas R, Chakraborti P, Chowdhury S, Cook A, Elliott T, Gale J, Gibbs S, Graham JD, Hetherington J, Hughes R, Laing R, McKinna F, McLaren DB, O’Sullivan JM, Parikh O, Peedell C, Protheroe A, Robinson AJ, Dhrihari N, Srinivasan R, Staffurth J, Sundar S, Tolan S, Tsang D, Wagstaff J and Parmar MKB (2016). Addition of docetaxel, zoledronic acid, or both to first-line long-term hormone therapy in prostate cancer (STAMPEDE): survival results from an adaptive, multiarm, multistage, platform randomised controlled trial. Lancet, 387, 1163–1177. [20] Jenni S, Oetliker C, Allemann S, Ith M, Tappy L, Wuerth S, Egger A, Boesch C, Schneiter Ph, Diem P, Christ E and Stettler C (2008). Fuel metabolism during exercise in euglycaemia and hyperglycaemia in patients with type 1 diabetes mellitus – a prospective single-blind randomised crossover trial. Diabetologia, 51, 1457–1465. [11, 13] Jensen PT, Klee MC, Thranov I and Groenvold M (2004). Validation of a questionnaire for self-assessment of sexual function and vaginal changes after gynaecological cancer. Psycho-Oncology, 13, 577–592 [4] Jiang W, Freidlin B and Simon R (2007). Biomarker adaptive threshold design: a procedure for evaluating treatment with possible biomarker-defined subset effect. Journal of the National Cancer Institute, 99, 1036–1043. [18] Jones B (2008). The cross-over trial: a subtle knife. Significance, 5, 135–137. [13]. Jones B, Jarvis P, Lewis JA and Ebbutt AF (1996). Trials to assess equivalence: the importance of rigorous methods. British Medical Journal, 313, 36–39. [15] Jordan S, Gabe-Walters ME, Watkins A, Humphreys I, Newson L, Snelgrove S and Dennis MS (2015). Nurse-led medicines’ monitoring for patients with dementia in care homes: a pragmatic cohort stepped wedge cluster randomised trial. PLoS One, 10, e0140203. [17] Jordhøy MS, Fayers PM, Ahlner-Elmquist M and Kaasa S (2002). Lack of concealment may lead to selection bias in cluster randomized trials of palliative care. Palliative Medicine, 16, 43– 49. [16] Julious SA (2005). Sample size of 12 per group rule of thumb for a pilot study. Pharmaceutical Statistics, 4, 287–291. [19] Julious SA, Campbell MJ and Altman DG (1999). Estimating sample sizes for continuous, binary, and ordinal outcomes in paired comparisons: practical hints. Journal of Biopharmaceutical Statistics, 9, 241–251. [13] Julious SA and Owen RJ (2006). Sample size calculations for clinical studies allowing for uncertainty about the variance. Pharmaceutical Statistics, 5. 29–37. [19] Jüni P, Altman DG and Egger M (2001). Assessing the quality of controlled clinical trials. BMJ, 323, 42–46. [20] Kahn RS, Fleischhacker WW, Boter H, Davidson M, Vergouwe Y, Keet IPM, Gheorghe MD, Rybakowski JZ, Galderisi S, Libiger J, Hummer M, Dollfus S, López-Ibor JJ, Hranov LG, Gaebel W, Peuskins J, Lindefors N, Riecher-Rössler A and Grobbee DE (2008). Effectiveness of

512

REFERENCES

antipsychotic drugs in first-episode schizophrenia and schizophreniform disorder: an open randomised clinical trial. Lancet, 371, 1085–1097. [2, 11, 12] Kaplan EL and Meier P (1958). Non parametric estimation from incomplete observations. Journal of the American Statistical Association, 53, 457–481. [1] Kerley CP, Dolan E, James PE and Cormican L (2018). Dietary nitrate lowers ambulatory blood pressure in treated, uncontrolled hypertension: a 7-d, double-blind, randomised, placebocontrolled, crossover trial. British Journal of Nutrition, 119, 658–663. [1, 13] Kieser M and Wassmer G (1996). On the use of the upper confidence limit for the variance from a pilot sample for sample size determination. Biometrical Journal, 8, 941–949. [19] Kim ES, Herbst RS, Wistuba II, Lee JJ, Blumenschein GR, Tsao A, Stewart DJ, Hicks ME, Erasmus J, Gupta S, Alden CM, Liu S, Tang X, Khuri FR, Tran HT, Johnson BE, Heymach JV, Mao L, Fossella F, Kies MS, Papadimitrakopoulou V, Davis, SE, Lippman SM and Hong WK (2011). The BATTLE trial: personalizing therapy for lung cancer. Cancer Discovery, 1, 44– 53. [18] Kim G, Chen E, Tay AYL, Lee JS, Phua JNS, Shabbir A, Sol JBY and Tai BC (2017). Extensive peritoneal lavage after curative gastrectomy for gastric cancer (EXPEL): study protocol of an international multicentre randomised controlled trial. Japanese Journal of Clinical Oncology, 47, 179–184. [3, 10, 20] King A (2009). Once-daily insulin detemir is comparable to once-daily insulin glargine in providing glycaemic control over 24 h in patients with type 2 diabetes: a double-blind, randomized, crossover study. Diabetes, Obesity and Metabolism, 11, 69–71. [15] Kohler JA, Imeson J, Ellershaw C and Lie SO (2000). A randomized trial of 13-Cis retinoic acid in children with advanced neuroblastoma after high-dose therapy. British Journal of Cancer, 83, 1124–1127. [8] Korn EL and Freidlin B (2017). Adaptive clinical trials: advantages and disadvantages of various adaptive design elements. Journal of the National Cancer Institute, 109, 1–6. [20] Krishna R, Anderson MS, Bergman AJ, Jin B, Fallon M, Cote J, Rosko K, Chavez-Eng C, Lutz R, Bloomfield DM, Gutierrez M, Doherty J, Bieberdorf F, Chodakewitz J, Gottesdiener KM and Wagner JA (2007). Effect of the cholesteryl ester transfer protein inhibitor, anacetrapib, on lipoproteins in patients with dyslipidaemia and on 24-h ambulatory blood pressure in healthy individuals: two double-blind, randomized placebo-controlled phase I studies. Lancet, 370, 1907–1914. [1, 11, 13] Lafeber M, Grobbee DE, Schrover IM, Thom S, Webster R, Rodgers A, Visseren FLJ, Bots ML and Spiering W (2015). Comparison of a morning polypill, evening polypill and individual pills on LDL-cholesterol, ambulatory blood pressure and adherence in high-risk patients; a randomized crossover trial. International Journal of Cardiology, 181, 193–199. [13] Lake JC, Victor G, Clare G, Porfírio GJM, Kernohan A and Evans JR (2019). Toric intraocular lens versus limbal relaxing incisions for corneal astigmatism after phacoemulsification. Cochrane Database of Systematic Reviews, 12, CD012801. [20] Lan KKG and DeMets DL (1983). Discrete sequential boundaries for clinical trials. Biometrika, 70, 659–663. [20] Lang TA and Altman DG (2013). Basic statistical reporting for articles published in clinical medical journals: the SAMPL Guidelines. In Smart P, Maisonneuve H and Polderman A (eds). Science Editors Handbook. European Association of Science Editors. [11] Larsson P-G and Carlsson B (2002). Does pre- and postoperative metronidazole treatment lower vaginal cuff infection rate after abdominal hysterectomy among women with bacterial vaginosis? Infectious Diseases in Obstetrics and Gynecology, 10, 133–140. [11, 20]

REFERENCES

513

Lau WY, Lai ECH, Leung TWT and Yu SCH (2008). Adjuvant intra-arterial iodine-131-labelled lipiodol for resectable hepatocellular carcinoma: a prospective randomized trial – update on 5-year and 10-year survival. Annals of Surgery, 247, 43–48. [10] Lau WY, Leung TWT, Ho SKW, Chan M, Machin D, Lau J, Chan ATC, Yeo W, Mok TSK, Yu SCH, Leung NWY and Johnson PJ (1999). Adjuvant intra-arterial iodine-131-labelled lipiodol for resectable hepatocellular carcinoma: a prospective randomised trial. Lancet, 353, 797–801. [10, 11] Lenth RV (2018). Java Applets for Power and Sample Size. http://www.stat.uiowa.edu/~rlenth/ Power. [9] Leong SS, Toh CK, Lim WT, Lin X, Tan SB, Poon D, Tay MH, Foo KF, Ho J and Tan EH (2007) A randomized phase II trial of single-agent Gemcitabine, Vinorelbine, or Docetaxel in patients with advanced non-small cell lung cancer who have poor performance status and/ or are elderly. Journal of Thoracic Oncology, 2, 230–236. [12] Levie K, Gjorup I, Skinhøj P and Stoffel M (2002). A 2-dose regimen of recombinant hepatitis B vaccine with the immune stimulant AS04 compared with the standard 3-dose regimen of EnderixTM-B in healthy young adults. Scandinavian Journal of Infectious Disease, 34, 610–614. [1, 11] Levy HL, Milanowski A, Chakrapani A, Cleary M, Lee P, Trefz FK, Whitley CB, Feillet F, Feigenbaum AS, Bebchuk JD, Christ-Schmidt H and Dorenbaum A (2007). Efficacy of sapropterin dihydrochloride (tetrahydrobiopterin, 6R-BH4) for reduction of phenylalanine concentration in patients with phenylketonuria: a phase III randomised placebo-controlled study. Lancet, 370, 504–510. [14] Lignou S (2018). Informed consent in cluster randomised trials: new and common ethical issues. Journal of Medical Ethics, 44, 114–120. [16] Lo ECM, Luo Y, Fan MW and Wei SHY (2001). Clinical investigation of two glass-ionomer restoratives used with the atraumatic restorative treatment approach in China: Two-years results. Caries Research, 35, 458–463. [13] Lo ECM, Luo Y, Tan HP, Dyson JE and Corbet EF (2006). ART and conventional root restorations in elders after 12 months. Journal of Dental Research, 85, 929–932. [5, 11] Lobo DN, Bostock KA, Neal KR, Perkins AC, Rowlands BJ and Allison SP (2002). Effect of salt and water balance on recovery of gastrointestinal function after elective colonic resection: a randomised controlled trial. Lancet, 359, 1812–1818. [1, 9] Lu Y and Bean JA (1995). On the sample size for one-sided equivalence of sensitivities based upon McNemar’s test. Statistics in Medicine, 14, 1831–1839. [15] Lu Q, Lee S-T, Young SE-L, Tan S-B and Machin D (2020). (Letter to the editor) Prospective clinical trial comparing outcome measures between Furlow and von Langenbeck palatoplasties for UCLP. Annals of Plastic Surgery, 85, 94–95. [12] Machin D, Campbell MJ, Tan SB and Tan SH (2018). Sample Size Tables for Clinical, Laboratory and Epidemiology Studies. (4th edn). Wiley, Chichester. [1, 3, 9, 11, 12] Machin D, Day S and Green S (eds) (2006). Textbook of Clinical Trials. (2nd edn). Wiley, Chichester. [1] Machin D and Lee ST (2000). The ethics of randomization trials in the context of cleft palate research. Plastic and Reconstructive Surgery, 105, 1566–1568. [20] Machin D, Stenning SP, Parmar MKB, Fayers PM, Girling DJ, Stephens RJ, Stewart LA and Whaley JB (1997). Thirty years of Medical Research Council randomized trials in solid tumours. Clinical Oncology, 9, 20–28. [2]

514

REFERENCES

MacIntyre CR, Seale H, Dung TC, Hien NT, Nga PT, Chightai AA, Rahman B, Dwyer DE and Wang Q (2015). A cluster randomised trial of cloth masks compared with medical masks in healthcare workers. BMJ Open, 5, e006577. [16] Maitland K, Kiguli S, Opoka RO, Engoro C, Olupot-Olupot P, Akech SO, Nyeko R, Mtove G, Reyburn H, Lang T, Brent B, Evans JA, Tibenderana JK, Crawley J, Russell EC, Levin M, Babiker AD and Gibb DM (2011). Mortality and fluid bolus in African children with severe infection. New England Journal of Medicine, 364, 2483–2495. [10] Martin J, Taljaard M, Girling A and Hemming K (2016). Systematic review finds major deficiencies in sample size methodology and reporting for stepped-wedge cluster randomised trials. BMJ Open, 6, e010166. [17] Maruish ME, Maruish M, Kosinski M, Bjorner JB, Gandek B, Turner-Bowker DM and Ware JE (2011). User’s Manual for the SF-36v2 Health Survey. (3rd edn). QualityMetric Inc, Lincoln, RI. [4] McGuire MK and Scheyer ET (2014). Randomized, controlled clinical trial to evaluate a xenogeneic collagen matrix as an alternative to free gingival grafting for oral soft tissue augmentation. Journal of Periodontology, 85, 1333–1341. [15] McMurray JJV, Östergren J, Swedberg K, Granger CB, Held P, Mivhelson EL, Olofsson B, Yusuf S and Pfeffer MA (2003). Effects of candesartan in patients with chronic heart failure and reduced left-ventricular systolic function taking angiotensin-converting-enzyme inhibitors: the CHARM-Added trial. Lancet, 362, 767–771. [20] Medical Research Council (2002). Cluster Randomised Trials: Methodological and Ethical Considerations. MRC Clinical Trials Series. Medical Research Council, London. [16] Medical Research Council Lung Cancer Working Party (1996). Randomized trial of palliative two-fraction versus more intensive 13-fraction radiotherapy for patients with inoperable non-small cell lung cancer and good performance status. Clinical Oncology, 8, 167–175. [2] Medical Research Council Whooping-Cough Immunization Committee (1951). The prevention of whooping-cough by vaccination. British Medical Journal, 1, 1463–1471. [1] Meggitt SJ, Gray JC and Reynolds NJ (2006). Azathioprine dosed by thiopurine methyltransferase activity for moderate-to-severe atopic eczema: a double-blind, randomised controlled trial. Lancet, 367, 839–846. [1, 2, 4, 8, 9, 11, 14] Meyer G, Warnke A, Mülhauser I and Bender R (2003). Effect on hip fractures of increased use of hip protectors in nursing homes: cluster randomised controlled trial. BMJ, 326, 76–78. [1, 2, 16] Mhurchu CN, Gorton D, Turley M, Jiang Y, Michie J, Maddison R and Hattie J (2013). Effects of a free school breakfast programme on children’s attendance, academic achievement and short-term hunger: results from a stepped-wedge, cluster randomised controlled trial. J Epidemiology & Community Health, 67, 257–264. [17] Miles D, Cameron D, Bondarenko I, Manzyuk L, Alcedo JC, Lopez RI, Im S-A, Canon J-L, Shparyk Y, Yardley DA, Masuda N, Ro J, Denduluri N, Hubeaux S, Quah C, Bais C and O’Shaughnessy J (2017). Bevacizumab plus paclitaxel versus placebo plus paclitaxel as first-line therapy for HER2-negative metastatic breast cancer (MERiDiAN): a double-blind placebo-controlled randomised phase III trial with prospective biomarker evaluation. European Journal of Cancer, 70, 146–155. [18] Moher D, Hopewell S, Schulz KF, Montori V, Gøtzsche PC, Devereaux PJ, Elbourne D, Egger M and Altman DG (2010). CONSORT 2010 Exploration and Elaboration: updated guidelines

REFERENCES

515

for reporting parallel group randomised trials. BMJ, 340, c869 (correction BMJ, 343: d6131). [1, 2, 5, 11] Moher D, Schultz KF and Altman DG (2001). The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomised trials. Lancet, 357, 1191–1194. [11] Montoya CJ, Higuita EA, Estrada S, Gutierrez FJ, Amariles P, Giraldo NA, Jimenez MM, Velasquez CP, Leon AL, Rugeles MT and Jaimes FA (2012). Randomized clinical trial of lovastatin in HIV-infected, HAART naïve patients (NCT00721305). Journal of Infection, 65, 549– 558. [14] Moore GF, Audrey S, Barker M, Bond L, Bonell C, Hardeman W, Moore L, O’Cathain A, Tinati T, Wight D and Baird J (2015). Process evaluation of complex interventions: Medical Research Council guidance. BMJ, 350, h1258. [17] Motzer RJ, Escudier B, Oudard S, Hutson TE, Porta C, Bracarda S, Grünwald V, Thompson JA, Figlin RA, Hollaender N, Urbanowitz G, Berg WJ, Kay A, Lebwohl D and Ravaud A (2008). Efficacy of everolimus in advanced renal cell carcinoma: a double-blind, randomised, placebocontrolled phase III trial. Lancet, 372, 449–456. [11] Muers MF, Stephens RJ, Fisher P, Darlison L, Higgs CMB, Lowry L, Nicholson AG, O’Brien M, Peake M, Rudd R, Snee M, Steele J, Girling DJ, Mankivell M, Pugh C and Parmar MKB (2008). Active symptom control with or without chemotherapy in the treatment of patients with malignant pleural mesothelioma (MS01): a multicentre trial. Lancet, 371, 1685–1694. [12] National Council for Social Studies (2019). Power Analysis and Sample Size Software (PASS): Version 2019. National Council for Social Studies, Kaysville, UT. [9] Newcombe RG and Altman DG (2000). Proportions and their differences. In Altman DG, Machin D, Bryant TN and Gardner MJ (eds). Statistics with Confidence. (2nd edn). BMJ Books, London, 45–56. [8] Nixon J, McElvenny D, Mason S, Brown J and Bond S (1998). A sequential randomised controlled trial comparing a dry visco-elastic polymer gel pad and standard operating table mattresses in the prevention of post-operative pressure sores. International Journal of Nursing Studies, 35, 193–203. [20] Nordin P, Thielecke M, Ngomi N, Mudanga GM, Krantz I and Feldmeier H (2017). Treatment of tungiasis with a two-component dimeticone: a comparison between moistening the whole foot and directly targeting the embedded sand fleas. Tropical Medicine and Health, 45, 6. [19] O’Brien PC and Fleming TR (1979). A multiple testing procedure for clinical trials. Biometrics, 35, 549–556. [10, 20] O’Cathain A, Croot L, Duncan E, Rousseau N, Sworn K, Turner KM, Yardley L and Hoddinott P (2019). Guidance on how to develop complex interventions to improve health and healthcare. BMJ Open, 9, e029954. [2, 17] Oliver PC, Crawford MJ, Rao B, Reece B and Tyrer P (2007). Modified overt aggression scale (MOAS) for people with intellectual disability and aggressive challenging behaviour: a reliability study. Journal of Applied Research in Intellectual Disabilities, 20, 368–372. [11] Palumbo A, Bringhen S, Caravita T, Merla E, Capparella V, Callea V, Cangialosi C, Grasso M, Rossini F, Galli M, Catalano L, Zamagni E, Petrucci MT, De Stefano V, Ceccarelli M, Ambrosini MT, Avonto I, Falco P, Ciccone G, Liberati AM, Musto P and Boccadoro M (2006). Oral melphalan and prednisone chemotherapy plus thalidomide compared with melphalan and prednisone alone in elderly patients with multiple myeloma: randomised controlled trial. Lancet, 367, 825–831. [4, 9]

516

REFERENCES

Papp K, Bissonnette R, Rosoph L, Wasel N, Lynde CW, Searles G, Shear NH, Huizinga RB and Maksymowych WP (2008). Efficacy of ISA247 in plaque psoriasis: a randomised multicentre, double-blind, placebo-controlled phase III study. Lancet, 371, 1337–1342. [12] Parmar MKB, Barthel FM-S, Sydes M, Langley R, Kaplan R, Eisenhauer E, Brady M, James N, Bookman MA, Swart A-M, Qian W and Royston P (2008). Speeding up the evaluation of new agents in cancer. Journal of the National Cancer Institute, 100, 1204–1214. [4, 20] Peters TJ, Richards SH, Bankhead CR, Ades AE and Sterne JAC (2003). Comparison of methods for analysing cluster randomized trials: an example involving a factorial design. International Journal of Epidemiology, 32, 840–846. [16] Peto R, Pike MC, Armitage P, Breslow NE, Cox DR, Howard SV, Mantel N, McPherson K, Peto J and Smith PG (1976). Design and analysis of randomized clinical trials requiring prolonged observation of each patient (I) Introduction and design. British Journal of Cancer, 34, 585–612. [1, 10] Peto R, Pike M, Armitage P, Breslow NE, Cox DR, Howard SV, Mantel N, McPherson K, Peto J and Smith PG (1977). Design and analysis of randomized clinical trials requiring prolonged observation of each patient (II) Analysis and examples. British Journal of Cancer, 35, 1–39. [1] Phillips A (2006). Flexibility by design. GCPj. Informa UK Ltd, 13–17, July. http://www.GCPj. com. [20] Piaggio G, Carroli G, Villar J, Pinol A, Bakketeig L, Lumbiganon P, Bergsjø P, Al-Mazrou Y, Ba’aqeel H, Belizán JM, Farnot U and Berendes H (2001). Methodological considerations on the design and analysis of an equivalence stratified cluster randomization trial. Statistics in Medicine, 20, 401–416. [15, 20] Piaggio G, Elbourne DR, Pocock SJ, Evans SJW and Altman DG (2012). Reporting of noninferiority and equivalence randomized trials: extension of the CONSORT 2010 statement. Journal of the American Medical Association, 308, 2594–2604. [11, 15] Piaggio G and Pinol AP (2001). Use of the equivalence approach in reproductive health clinical trials. Statistics in Medicine, 20, 3571–3577. [15] Pintilie M (2006). Competing risks: A practical perspective. Wiley, Chichester. Pocock SJ (1983). Clinical Trials: A Practical Approach. Wiley, Chichester. [1, 10, 20] Pocock SJ and Ware JE (2009). Translating statistical findings into plain English. Lancet, 373, 1926–1928. [8, 11] Pocock SJ and White I (1999). Trials stopped early: too good to be true? Lancet, 353, 943–944. [10] Polack FP, Thomas SJ, Kitchin N, Absalon J, Gurtman A, Lockhart S, Perez JL, Marc GP, Moreira ED, Zerbini C, Bailey R, Swanson KA, Roychoudhury S, Koury K, Li P, Kalina WV, Cooper D, Frenck RW, Hammitt LL, Türeci Ö, Nell H, Schaefer A, Ünal S, Trena DB, Mather S, Dormitzer MD, Şahin U, Jansen KU and Gruber WC (2020). Safety and efficacy of the BNT162b2 mRNA Covid-19 vaccine. New England Journal of Medicine, 383, 2603–2615. [Preface] Poldervaart JM, Reitsma JB, Backus BE, Koffijberg H, Veldkamp RF, ten Haaf ME, Appelman Y, Mannaerts HFJ, van Dantzig J-M, van den Heuvel M, el Farissi M, Rensing BJWM, Ernst NMSAKJ, Dekker IMC, den Hartog FR, Oosterhof T, Lagerweij GR, Buijs EM, van Hessen MWJ, Landman MAJ, van Kimmenade RRJ, Cozijbsen L, Bucx JJJ, van Ofwegen-Hanekamp CEE, Cramer M-J, Six AJ, Doevendans PA and Hoes AW (2017). Effect of using the HEART score in patients with chest pain in the emergency department. Annals of Internal Medicine, 166, 689–697. [17] Poldervaart JM, Reitsma JB, Koffijberg H, Backus BE, Six AJ, Doevendans PA and Hoes AW (2013). The impact of the HEART risk score in the early assessment of patients with acute

REFERENCES

517

chest pain: design of a stepped wedge, cluster randomised trial. BMC Cardiovascular Disorders, 13, 77. [17] Pool SMW, Struys MMRF and van der Lei B (2015). A randomised double-blind crossover study comparing pain during anaesthetising the eyelids in upper blepharoplasty: first versus second eyelid and lidocaine versus prilocaine. Journal of Plastic, Reconstructive & Aesthetic Surgery, 68, 1242–1247. [13] Poon CY, Goh BT, Kim M-J, Rajaseharan A, Ahmed S, Thongsprasom K, Chaimusik M, Suresh S, Machin D, Wong HB and Seldrup J (2006). A randomised controlled trial to compare steroid with cyclosporine for the topical treatment of oral lichen planus. Oral Surgery Oral Medicine Oral Pathology Oral Radiology and Endodontics, 102, 47–55. [2, 3, 5, 7, 8, 11, 13] Porter S, McConnell, G-WL, Regan J, McKeown M, Kirkwood J, Clarke M, Gardner E, Dorman S, McGrillen K and Reid J (2018). A randomized controlled pilot and feasibility study of music therapy for improving the quality of life of hospice inpatients. BMC Palliative Care, 17, 125. [19] Pozzi A, Agliardi E, Tallarico M and Barlattani A (2012). Clinical and radiological outcomes of two implants with different prosthetic interfaces and neck configurations: randomized, controlled, split-mouth clinical trial. Clinical Implant Dentistry and Related Research, 16, 96–106. [1, 13] Ravaud A, Hawkins R, Gardner JP, von der Maase H, Zantl N, Harper P, Rolland F, Audhuy B, Machiels JP, Pétavy F, Gore M, Schöffski P and El-Hariry I (2008). Lapatinib versus hormone therapy in patients with advanced renal cell carcinoma: a randomized phase III clinical trial. Journal of Clinical Oncology, 26, 2285–2291. [18] Redwood C and Colton T (eds) (2001). Biostatistics in Clinical Trials. Wiley, Chichester. [1] Reynolds NJ, Franklin V, Gray JC, Diffey BL and Farr PM (2001). Narrow-band ultraviolet B and broad-band ultraviolet A phototherapy in adult atopic eczema: a randomised controlled trial. Lancet, 357, 2012–2016. [14] Rutterford C, Copas A and Eldridge S (2015). Methods for sample size determination in cluster randomized trials. International Journal of Epidemiology, 44, 1051–1067. [16] SAS Institute (2017). Getting Started with the SAS Power and Sample Size Application: Version 9.4. SAS Institute, Cary, NC. [9] Schoenfeld D (1982). Partial residuals for the proportional hazards regression model. Biometrika, 69, 239–241. [11] Schottenfeld RS, Chawarski MC and Mazlan M (2008). Maintenance treatment with buprenorphine and naltrexone for heroin dependence in Malaysia: a randomised, double-blind, placebo-controlled trial. Lancet, 371, 2192–2200. [10, 12] Schroeder H, Werner M, Meyer D-R, Reimer P, Kruger K, Jaff MR and Brodmann M (2017). Low-dose paclitaxel-coated versus uncoated percutaneous transluminal balloon angioplasty for femoropopliteal peripheral artery disease. Circulation, 135, 2227–2236. [15] Schultz KF, Altman DG and Moher D for the CONSORT Group (2010). CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials. BMJ, 340, c332. [1] Scott NW, McPherson GC, Ramsay CR and Campbell MK (2002). The method of minimization for allocation to clinical trials. Controlled Clinical trials, 23, 662–674. [5] Senn SJ (2002). Cross-over Trials in Clinical Research. (2nd edn). Wiley, Chichester. [13] Shamseer L, Moher D, Clarke M, Ghersi D, Liberati A, Pettocrew M, Shekelle P and Stewart LA (2015). Preferred reporting items for systematic review and meta-analysis protocols (PRISMA-P) 2015: elaboration and explanation. BMJ, 349, g7647. [20]

518

REFERENCES

Sim J and Lewis M (2011). The size of a pilot study for a clinical trial should be calculated in relation to considerations of precision and efficiency. Journal of Clinical Epidemiology, 65, 301–308. [19] Simon R (2005). A roadmap for developing and validating therapeutically relevant genomic classifiers. Journal of Clinical Oncology, 23, 7332–7341. [18] Simon R (2008). The use of genomics in clinical trial design. Clinical Cancer Research, 14, 5984– 5993. [18] Simon R (2012). Clinical trials for predictive medicine. Statistics in Medicine, 31, 3031– 3040. [18] Simon R, Wittes RE and Ellenberg SS (1985). Randomized phase II clinical trials. Cancer Treatment Reports, 69, 1375–1381. [12] Singapore Lichen Planus Study Group (2004). A randomized controlled trial to compare calcipotriol with betamethasone valerate for the treatment of cutaneous lichen planus. Journal of Dermatological Treatment, 15, 141–145. [11] Smith I, Procter M, Gelber RD, Guillaume S, Feyereislova A, Dowsett M, Goldhirsch A, Untch M, Mariani G, Baselga J, Kaufmann M, Cameron D, Bell R, Bergh J, Coleman R, Wardley A, Harbeck N, Lopez RI, Mallmann P, Gelmon K, Wilcken N, Wist E, Rovira PS and PiccartGebhart MJ (2007). 2-year follow-up of trastuzumab after adjuvant chemotherapy in HER2-positive breast cancer: a randomised controlled trial. Lancet, 369, 29–36. [1, 5, 11] Smolen JS, Beaulieu A, Rubbert-Roth A, Ramos-Remus C, Rovensky J, Alecock E, Woodworth T and Alten R (2008). Effect of interleukin-6 receptor inhibition with tocilizumab in patients with rheumatoid arthritis (Option study): a double-blind, placebo-controlled, randomised trial. Lancet, 371, 987–997. [11, 12] Sosnowski K, Mitchell ML, White H, Morrison L, Sutton J, Sharratt J and Lin F (2018). A feasibility study of a randomised controlled trial to examine the impact of ABCDE bundle on quality of life in ICU survivors. Pilot and Feasibility Studies, 4, 32. [19] Spiegelhalter DJ, Freedman LS and Parmar MKB (1994). Bayesian approaches to randomized trials (with discussion). Journal of the Royal Statistical Society A, 157, 357–416. [20] Sprangers MA, Cull A, Groenvold M (1998). EORTC Quality of Life Study Group: Guidelines for Developing Questionnaire Modules. EORTC, Brussels. [4] StataCorp (2019). Stata Statistical Software Release 16. StataCorp LLC, College Station, TX. [8, 9, 11, 13] Statistical Solutions (2019). nQuery Adviser: Version 7.0. Statistical Solutions Ltd., Saugus, MA. [9] Stevinson C, Devaraj VS, Fountain-Barber A, Hawkins S and Ernst E (2003). Homeopathic arnica for prevention of pain and bruising: randomized placebo-controlled trial in hand surgery. Journal of the Royal Society of Medicine, 96, 60–65. [1, 2, 5, 9, 11, 12] Stone RM, Mandrekar SJ, Sanford BL, Laumann K, Geyer S, Bloomfield CD, Thiede C, Prior TW, Döhner K, Marcucci G, Lo-Coco F, Klisovic RB, Wei A, Sierra J, Sanz MA, Brabdwein JM, de Witte T, Niederwieser D, Appelbaum FR, Medeiros BC, Tallman MS, Krauter J, Schlenk RF, Ganser A, Serve H, Ehninger G, Amadori S, Larson RA and Döhner H (2017). Midostaurin plus chemotherapy for acute myeloid leukemia with FLT3 mutation. The New England Journal of Medicine, 377, 454–464. [18] Sugg HVR, Richards DA and Frost J (2018). Morita Therapy for depression (Morita Trial): a pilot randomised controlled trial. BMJ Open, 8, e021605. [19] Syn NL, Wong AL-A, Lee S-C, Teoh H-L, Yip JWL, Seet RCS, Yeo WT, Kristanto W, Bee P-C, Poon LM, Marban P, Wu TS, Winther MD, Brunham LR, Soong R, Tai B-C and Goh B-C

REFERENCES

519

(2018). Genotype-guided versus traditional clinical dosing of warfarin in patients of Asian ancestry: a randomized controlled trial. BMC Medicine, 16, 104–113. [15] Szefler SS, Mitchell H, Sorkness CA, Gergen PJ, O’Connor GT, Morgan WJ, Kattan M, Pongracic JA, Teach SJ, Bloomberg GR, Eggleston PA, Gruchallas RS, Kercsmar CM, Liu AH, Wildfire JJ, Curry MD and Busse WW (2008). Management of asthma based on exhaled nitric oxide in addition to guideline-based treatment for inner-city adolescents and young adults: a randomised controlled trial. Lancet, 372, 1065–1072. [11] Tai BC, Chen ZJ, Machin D (2018). Estimating sample size in the presence of competing risks – Cause-specific hazard or cumulative incidence approach? Statistical Methods in Medical Research, 27, 114–125. [9] Tai B-C and Machin D (2014) Regression Methods for Medical Research. Wiley, Chichester. [8] Tai BC, Wee JTS and Machin D (2011). Analysis and design of randomised clinical trials involving competing risks endpoints. Trials, 12, 127. [9] Tan SB, Chung YFA, Tai BC, Cheung YB and Machin D (2003). Elicitation of prior distributions for a phase III randomized controlled trial of adjuvant therapy with surgery for hepatocellular carcinoma. Controlled Clinical Trials, 24, 110–121. [20] Tan SB, Dear KBG, Bruzzi P and Machin D (2003). Strategy for randomised clinical trials in rare cancers. BMJ, 327, 47–49. [20] Tan EH, Wong ALA, Tan CC, Wong P, Tan SH, Ang YLE, Lim SE, Chong WQ, Ho JS, Lee SC and Tai BC (2020). Improving medication adherence with adjuvant aromatase inhibitor in women with breast cancer: a randomised controlled trial to evaluate the effect of short message service (SMS) reminder. The Breast, 53: 77–84. [4] Tan SB, Wee J, Wong H-B and Machin D (2008). Can external and subjective information ever be used to reduce the size of a randomised controlled trial? Contemporary Clinical Trials, 29, 211–219. [20] Tang C-L, Eu K-W, Tai B-C, Soh JGS, Machin D and Seow-Choen F (2001). Randomized clinical trial of the effect of open versus laparoscopically assisted colectomy on systemic immunity in patients with colorectal cancer. British Journal of Surgery, 88, 801–807. [4] Teare MD, Dimairo M, Shephard N, Hayman A, Whitehead A and Walters SJ (2014). Sample size requirements to estimate key design parameters from external pilot randomised controlled trials: a simulation study. Trials, 15, 264. [19] The ATAC (Arimidex, Tamoxifen Alone or in Combination) Trialists’ Group (2002). Anastrozole alone or in combination with tamoxifen versus tamoxifen alone for adjuvant treatment of postmenopausal women with early breast cancer: first results of the ATAC randomised trial. Lancet, 359, 2131–2139. [12] The RECOVERY Collaborative Group (2020). Dexamethasone in hospitalized patients with Covid19 - Preliminary report, New England Journal of Medicine. doi:10.1056/NEJMoa2021436. [Preface] Thielecke M, Nordin P, Ngomi N and Feldmeier H (2014). Treatment of tungiasis with dimeticone: a proof-of-principle study in Rural Kenya. PloS Neglected Tropical Diseases, 8, e3058. [19] Todd J, Heyderman RS, Musoke P and Peto T (2013). When enough is enough: how the decision was made to stop the FEAST trial: data and safety monitoring in an African trial of fluid expansion as supportive therapy (FEAST) for critically ill children. Trials, 14, 85. [10] Tyrer P, Oliver-Africano PC, Ahmed Z, Bouras N, Cooray S, Deb S, Murphy D, Hare M, Meade M, Reece B, Kramo K, Bhaumik S, Harley D, Regan A, Thoma D, Rao B, North B, Eliahoo J, Karatela S, Soni A and Crawford M (2008). Risperidone, haloperidol, and placebo in the

520

REFERENCES

treatment of aggressive challenging behaviour in patients with intellectual disability: a randomised controlled trial. Lancet, 371, 57–63. [11, 12] Underwood M, Lamb SE, Eldridge S, Sheehan B, Slowther A-M, Spencer A, Thorogood M, Atherton N, Bremner SA, Devine A, Diaz-Ordaz K, Ellard DR, Potter R, Spanjers K and Taylor SJC (2013). Exercise for depression in elderly residents of care homes: a clusterrandomised controlled trial. Lancet, 382, 41–49. [16] van der Meché FGA and Schmitz PIM (1992). A randomized trial comparing intravenous immune globulin and plasma exchange in Guillain-Barré syndrome. New England Journal of Medicine, 326, 1123–1129. [11] van Herwaarden N, van der Maas A, Minten MJM, van den Hoogen FHJ, Kievit W, van Vollenhoven RF, Bijlsma JWJ, van den Bemt BJF and den Broeder AA (2015). Disease activity guided dose reduction and withdrawal of adalimumab or etanercept compared with usual care in rheumatoid arthritis: open label, randomised controlled, non-inferiority trial. BMJ, 350, h1389. [15] Verstraete E, Veldink JH, Huisman MHB, Draak T, Uijendaal EV, van der Kooi AJ, Schelhaas HJ, de Visser M, van der Tweel I and van den Berg LH (2012). Lithium lacks effect on survival in amyotrophic lateral sclerosis: a phase IIb randomised sequential trial. Journal of Neurology, Neurosurgery, Psychiatry, 83, 557–564. [20] Vigneault E, Morton G, Parulekar WR, Niazi TM, Springer CW, Barkati M, Chung P, Koll W, Kamran A, Monreal M, Ding K and Loblaw A (2018). Randomised Phase II feasibility trial of image-guided external beam radiotherapy with or without high dose rate brachytherapy boost in men with intermediate-risk prostate cancer (CCTG PR15/NCT01982786). Clinical Oncology, 30, 527–533. [19] Wager E and Elia N (2014). Why should clinical trials be registered? European Journal of Anaesthesiology, 32, 397–400. [6] Walker E and Nowacki AS (2011). Understanding equivalence and noninferiority testing. Journal of General Internal Medicine, 26, 192–196. [15] Walter SD, Han H, Guyatt GH, Bassler D, Bhatnagar N, Gloy V, Schandelmaier S and Briel M (2020). A systematic survey of randomised trials that stopped early for reasons of futility. BMC Medical Research Methodology, 20, 10. [10] Walters SJ, dos Anjos Henriques-Cadby IB, Bortolami O, Hind D, Flight L, Hind D, Jacques RM, Knox C, Nadin B, Rothwell J, Surtees M and Julious SA (2017). Recruitment and retention of participants in randomised controlled trials: a review of trials funded and published by the United Kingdom Health Technology Assessment Programme. BMJ Open, 7, e015276. [9]. Wang D and Bakhai A (eds) (2006). Clinical Trials. A Practical Guide to Design, Analysis and Reporting. Remedica, London. [1] Ware JE, Snow KK, Kosinski M and Gandek B (1993). SF-36 Health Survey Manual and Interpretation Guide. New England Medical Centre, Boston, MA. [4] Wee JTS, Tan EH, Tai BC, Wong H-B, Leong SS, Tan T, Chua E-T, Yang E, Lee KM, Fong KW, Khoo HS, Lee KS, Loong S, Sethi V, Chua EJ and Machin D (2005). Randomized trial of radiotherapy versus concurrent chemo-radiotherapy followed by adjuvant chemotherapy in patients with AJCC/UICC (1997) stage III and IV nasopharyngeal cancer of the endemic variety. Journal of Clinical Oncology, 23, 6730–6738. [2, 4, 9, 11, 20] Weng J, Li Y, Xu W, Shi L, Zhang Q, Zhu D, Hu Y, Zhou Z, Yan X, Tian H, Ran X, Luo Z, Xian J, Yan L, Li F, Zeng L, Chen Y, Yang L, Yan S, Liu J, Li M, Fu Z, and Cheng H (2008). Effect of intensive insulin therapy on β-cell function and glycaemic control in patients with newly

REFERENCES

521

diagnosed type 2 diabetes: a multicentre randomised parallel-group trial. Lancet, 371, 1753– 1760. [1, 12] Whitehead AL, Julious SA, Cooper CL and Campbell MJ (2015). Estimating the sample size for a pilot randomised trial to minimise the overall trial sample size for the external pilot and main trial for a continuous outcome variable. Statistical Methods in Medical Research, 25, 1057– 1073. [19] WHO (2006). WHO Child Growth Standards. World Health Organization, Geneva. [4] WHO (2020). WHO International Clinical Trials Registry Platform. SCI/RFH/EPS/ICTRP. [6] Williams HC, Burney PGJ, Hay RJ, Archer CB, Shipley MJ, Hunter JJA, Bingham EA, Finlay AY, Pembroke AC, Graham-Brown RAC, Atherton DA, Lewis-Jones MS, Jolden CA, Harper JI, Champion RH, Poyner TF, Launer J and David TJ (1994). The U.K. Working Party’s Diagnostic Criteria for Atopic Dermatitis: I. Derivation of a minimum set of discriminators for atopic dermatitis. British Journal of Dermatology, 131, 383–396. [2] Williams WN, Seagle MB, Pregoraro-Krook MI, Souza TV, Garla L, Silva ML, Macado Neto JS, Dutka JCR, Nackashi J, Boggs S, Shuster J, Moorhead J, Wharton W, Graciano MIG, Pimentel MC, Feniman M, Piazentin-Penna SHA, Kemker J, Zimmerman MC, Bento-Gonçalvez C, Borgo H, Marques IL, Martinelli APMC, Jorge JC, Antonelli P, Neves JFA and Whitaker ME (2011). Prospective clinical trial comparing outcome measures between Furlow and von Langenbeck palatoplasties for UCLP. Annals of Plastic Surgery, 66, 154–163. [8, 12] Wright JR, Bouma S, Dayes I, Simunovic MR, Levine MN and Whelan TJ (2006). The importance of reporting patient recruitment details in Phase III trials. Journal of Clinical Oncology, 24, 843–845. [2] Yang HK, Ji JF, Han SU, Terashima M, Li GX, Kim HH, Law S, Shabbir A, Song KY, Woo JH, Hyung WJ, Kosai NR, Kono K, Misawa K, Yabusaki H, Kinoshita T, Lau PC, Kim YW, Rao JR, Ng E, Yamand T, Yoshida K, Park DJ, Tai BC and So JBY (2021). Extensive peritoneal lavage with saline after curative gastrectomy for gastric cancer (EXPEL): a multi-centre randomised trial controlled Phase III trial. Lancet Gastroenterology and Hepatology, 6, 120–127. [20] Yeatts SD, Martin RH, Coffey CS, Lyden PD, Foster LD, Woolson RF, Broderick JP, Di Tullio MR, Junbreis CA and Palesch YY (2014). Challenges of decision making regarding futility in a randomized trial: the interventional management of stroke III experience. Stroke, 45, 1408– 1414. [10] Yeo TJ, Yeo PSD, Hadi FA, Cushway T, Lee KY, Yin FF, Ching A, Li R, Loh SY, Lim SL, Wong RC-C, Tai BC, Richards AM and Lam CSP (2018). Single-dose intravenous iron in Southeast Asian heart failure patients: a pilot randomized placebo-controlled study (PRACTICE-ASIAHF). ESC Heart Failure, 5, 344–353. [14] Yeow VK-L, Lee S-T, Cheng JJ, Chua K, Wong H-B and Machin D (2007). Randomised clinical trials in plastic surgery: survey of output and quality of reporting. Journal of Plastic and Reconstructive Surgery, 60, 965–966. [2, 7, 12] Yeow VK-L, Young SE-L, Chen PK-T, Lee S-T, Machin D and Lu Q (2019). Surgical and speech outcomes of two surgical techniques and two timings for cleft palate repair: a randomised trial. Australasian Journal of Plastic Surgery, 2, 44–54. [1, 2, 8, 10, 11, 12] Young T, de Haes H, Curran D, Fayers PM, Brandberg Y, Vanvoorden V and Bottomley A (2002). Guidelines for Assessing Quality of Life in EORTC Clinical Trials. European Organisation for Research and Treatment of Cancer (EORTC), Brussels. [4] Yu L-M, Chan A-W, Hopewell S, Deeks JJ and Altman DG (2010). Reporting on covariate adjustment in randomised controlled trials before and after revision of the 2001 CONSORT statement: a literature review. Trials, 11, 59. [11]

522

REFERENCES

Zelen M (1979). A new design for randomized clinical trials. New England Journal of Medicine, 300, 1242–1245. [20] Zelen M (1992). Randomised consent trials. Lancet, 340, 375. [20] Zheng J-P, Kang J, Huang S-G, Chen P, Yao W-Z, Yang L, Bai C-X, Wang C-Z, Wang C, Chen B-Y, Shi Y, Liu C-T, Chen P, Li Q, Wang Z-S, Huang Y-J, Luo Z-Y, Chen F-P, Yuan J-Z, Yuan B-T, Qian H-P, Zhi R-C and Zhong N-S (2008). Effect of carbocisteine on acute exacerbation of chronic obstructive pulmonary disease (PEACE Study): a randomised placebo-controlled study. Lancet, 371, 2013–2018. [11] Zhu H, Zhang S and Ahn C (2017). Sample size considerations for split-mouth design. Statistical Methods in Medical Research, 26, 2543–2551. [13] Zongo I, Dorsey G, Rouamba N, Tinto H, Dokomajilar C, Guiguemde RT, Rosenthal PJ and Ouedraogo JB (2007). Artemether-lumefantrine versus amodiaquine plus sulfadoxinepyrimethamine for uncomplicated falciparum malaria in Burkina Faso: a randomised no-inferiority trial. Lancet, 369, 491–498. [1, 5, 8] Zwarenstein M, Treweek S, Gagniere JJ, Altman DG, Tunis S, Haynes B, Oxman AD and Moher D (2008). Improving the reporting of pragmatic trials: an extension of the CONSORT statement. BMJ, 337, a2390. [11]

Index

Note: Page numbers in italic refer to figures. Page numbers in bold refer to tables.abstracts, in protocols, 47–8 acute angle-closure glaucoma, 47, 61, 63, 68, 69 acute myeloid leukaemia, 423, 457–9 adaptive approaches, 452–61 sequential, 452–6 adaptive threshold designs, 431–4 adjusting for baselines (design), 174–6, 340 adverse events, 82, 215–7 allocation ratios to detect, 201 code breaking, 118 describing in papers, 255–7 multicentre trials, 133 need to report, 129 protocol amendment, 125 reporting to external agencies, 133–4 age inclusion criterion, 57–8 myeloma and, 82 aggressive challenging behaviour, 239, 245–6, 273–4 AHCC01 trial (1997) protocol, 52 randomisation, 58 allocation of drugs to patients, 129–30 allocation ratios, 36, 52, 108–9, 110, 198, 201 fundamental equation, 190 alpha-spending function, 467–9 alternative hypothesis (HA), 187, 206 1-sided, 206, 419–20 ambulatory blood pressure, 307–8, 310 amendments, protocols, 73–5, 125 amyotrophic lateral sclerosis, 456, 457 anacetrapib blood pressure, 306 dyslipidaemia, 255–7, 306 anaesthesia

children, 221–2 eyelid surgery, 293–4, 295, 296, 298–9, 303, 309–10, 311 analysis, 38–41, 147–8, 181–2, 299 adaptive threshold designs, 432 biomarker-stratified designs, 425–7 cluster trials, 386–8 complex structure comparisons, 290 cross-over trials, 308–10 dose–response designs, 277–9 enrichment designs, 422 equivalence trials, 371 factorial designs, 285–8 matched-pair trials, 295–301 non-inferiority trials, 360–6 repeated measures design, 322–4 reporting on, 247–8 specification in protocols, 64–6 split-mouth designs, 314–5 stepped wedge designs, 399–404 unstructured design trials, 266, 272 see also interim analyses; meta-analyses analysis files, 144 analysis of covariance (ANCOVA), 340–1, 344, 345 analysis of variance (ANOVA), 273, 299, 303 anthropometric measures, 88 antipsychotic drugs, 270 antipyretics, 39 appendices, protocols, 72–3, 74 approval from authorities, 126 archiving, 144 area under the curve, 148–9 arithmetic mean, 149 arnica, 6, 30, 202, 275–9

Randomised Clinical Trials: Design, Practice and Reporting, Second Edition. David Machin, Peter M. Fayers, and Bee Choo Tai. © 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.

524 assessments, 82–3 describing in papers, 238 specification in protocols, 61–2 asthma, 241, 248 ATAC trial, breast cancer, 266–7, 268–9 atopic eczema, 8, 13–4, 30–1, 150, 152, 156, 157, 161–2, 181, 190–1, 195–6, 250–2, 254, 256, 257, 321 phototherapy, 321, 325, 341, 342 SASSAD score, 13, 14, 36, 150, 152, 153, 154, 156, 157, 169–70, 204, 254, 257, 321 audit trails, 131 authorship, 143, 233–4 autocorrelation, 319, 331–9, 348 exchangeable, 332–3, 338–44, 348 autocorrelation coefficient, 331 autoregressive observations, 333 Avicenna (980–1037), 21 azathioprine, 8, 13–4, 250–2, 254, 257, 321 azithromycin, cystic fibrosis, 295, 299–301, 304–5 background in papers, 235–6 in protocols, 49 baselines, 36–7, 92–3 adjusting for, 174–6, 340 change from (design), 339, 344, 346, 346–7 patient characteristics, 252, 253 repeated measures design, 322, 325–8 stepped wedge designs, 392 Bayesian statistics, 451, 463–7 very small trials, 472 Bayesian synthesis, 464 bed sores assessment scale, 84 mattresses, 452–3, 454–5 as multiple sites, 355–6 time to develop, 86 before-and-after design, 17, 399, 409–10 best intervention, 268–9 β2-microglobulin, 82 beta-binomial distribution, 470 between-subject standard deviations, 302 bexarotene, non-small-cell lung cancer, 433 bias carry-across effects, 313 intention-to-treat, 40 from missing data, 131 from non-randomisation, 17 publication bias, 477 response bias, 100–1 risk of, 479–80 selection bias, 103–4, 114, 388

INDEX binary endpoints, 78, 79, 304 cluster trials, 383–5 non-inferiority trials, 362–6, 368–70 from ordered categorical data, 161 pilot studies for, 447 repeated measures design, 346 split-mouth designs, 317 trial size, 191, 194 binary sequences, random, 104–5 biomarker enrichment designs (BrichD), 420–1 biomarkers, 417–34 biomarker-stratified designs, 422–34 blepharoplasty, anaesthesia, 293–4, 295, 296, 298, 303, 309–10, 311 blinding, 16–7, 53, 91 cluster trials, 388 consent and, 103–4 drug packages, 117–8, 237 malfunction in, 259 blocks randomisation, 106–11, 114, 115, 282 random size change, 108 blood pressure, 382–3, 386–8 ambulatory, 307–8, 310 anacetrapib, 306 grading, 91–2 see also hypertension body surface area, atopic eczema, 321, 325 bolus fluid expansion, children, 225 breast cancer ATAC trial, 266–7, 268–9 COMPLIANCE trial, 73, 74, 88–9, 94 HER2-positive, 7, 113 vascular endothelial growth factor, 426 brevity, forms, 72 brimonidine, 47, 61, 63, 68, 69 British Journal of Cancer, 21–2 British Medical Association, Collective Investigation Record, 21 bullet points, 57 buprenorphine, 220, 271–2 burns consent and, 68 endpoints, 83 healing, 79, 143 pain measurement, 85 partial thickness, 32, 143 cancer cachexia, 225 tumour measurement, 83 Zelen randomised consent designs, 474 see also specific cancers

INDEX candesartan, 462, 463 carbocisteine, COPD, 245 care homes consent in, 376–7 exercise for depression, 377 hip protectors, 10–1, 29, 35, 379 nurse-led medicines monitoring, 410–1 caries, 313, 314–5, 317 carpal tunnel syndrome, 6, 30, 237 carry-across effects, split-mouth designs, 312, 313 carry-over effects, cross-over trials, 306 case record forms, 130–1 case series, 18 categorical data (nominal data), 78–9, 91–2 cause-specific hazard (CSH), 204–5 censored data, 80, 86, 202 survival times, 163, 165–6 see also missing assessments change from baseline, design effect, 339, 344, 346, 346–7 CHARM-Added trial, 462, 463 checking data capture, 131–2 checklists, for publication process, 231 ChemFlex (dental), Fuji IX Gp vs, 316 chemoradiotherapy, nasopharyngeal cancer, 29, 50, 54, 96–8, 204–5, 217–8, 244, 466 chest pain, 407–8 children bolus fluid expansion, 225 general anaesthesia, 221–2 chi-squared test, 267 Christmas tree boundaries, double triangular sequential design, 455 chronic obstructive pulmonary disease (COPD) carbocisteine, 245 drugs compared, 284, 285–6 cisplatin, mesothelioma, 287 cleft palate, 48, 61, 160, 218, 281, 283, 474 unilateral, 288 clinical opinion, as elicited prior distribution, 466, 467 clinical practice, influence of trial results, 20 clinical trial, defined, 4 closed cohort stepped wedge designs, 391, 392, 394, 408–13 closed questions questionnaires, 99–100 trial forms, 95 clusters, outcome measures, 80 cluster trials, 10–1, 375–89 stepped wedge designs, 391–413 varying cluster size, 385–6 Cochrane Collaboration, systematic reviews, 22, 476

525 code breaking, double-blind RCT, 118, 128–9 Cohen criteria, 199–200 colectomy, trial form, 96 collective authorship, 234 Collective Investigation Record, BMA, 21 combining trial results, 478 communications networks, 128 competing risks (CR), 204–5 complex interventions, 414 complex structure comparisons, 289–91 COMPLIANCE trial, breast cancer, 73, 74, 88–9, 94 composite endpoints, 85 compound symmetry (exchangeable autocorrelation), 332–3, 338–44, 348 computer-based randomisation, 118, 241–2 conclusions, reporting, 259–60 conditional power test, 226–8, 469 confidence intervals (CI), 149–50, 152, 159, 183 forest plots, 478 for median difference, 155 non-inferiority trials, 359–60 repeated measures design, 351 confidence levels (η), 469 confidential memos, on randomisation, 58 confirmatory trials, 24, 25, 29 conflicts of interest, 234–5 consent, 15, 68–9 cluster trials, 376–7 delay after, 37 forms electronic, 69 protocol appendices, 73, 74 prior knowledge influencing, 103–4 process, 32–3, 238–40 withdrawals of, 33, 37–8 see also Zelen randomised consent designs consistency checks, 140 Consolidation of the Standards of Reporting Trials (CONSORT) statement, 231–2, 260–1 cluster trials, 388, 389 pilot studies, 449 trial profiles, 248–50 constraints, 18–20 contamination, cluster trials, 376 continuous endpoints cluster trials, 382 multiple interventions, 270–1 non-inferiority trials, 361–2, 366–8 pilot studies for, 446 repeated measures design, 344–6 continuous numerical data, 79, 91–2 dichotomisation, 431–2 stratification, 111

526 continuous subcutaneous insulin infusion, 6, 289–90, 291 contrast (C), repeated measures design, 325, 342–3 controls, 34 convergence problems, autocorrelation, 348 corneal astigmatism, 479–80 coronary arteries, magnesium stents, 11, 29, 30, 201 correlation coefficients, 331 within-subject, 313, 355 correspondence, 144 costs, 90 covariance analysis of (ANCOVA), 340–1, 344, 345 contrast (C), 342–3 COVID-19 pandemic, 69 Cox, D.R., 22 Cox proportional hazards regression model, 173–4, 175–6, 179 cranberry juice, prostate cancer radiotherapy, 111–2 Crohn’s disease, 252 cross-checks, 132 cross-over trials, 293, 305–11, 367–8 examples, 8–9, 365–6, 367–8, 372 guidelines, 317 two-period – two-treatment, 305, 307, 308–10, 364–6 cross-sectional (open cohort) stepped wedge designs, 391, 392, 394, 398–409 cumulative incidence method (CMI), 205 cumulative normal distributions, standard, 148 cumulative probability function, 468 cut-points, adaptive threshold designs, 432 cyclosporine, oral lichen planus, 28, 55, 59–60, 65–6, 71, 141, 158, 160, 171, 236, 244, 247 cystic fibrosis, azithromycin, 295, 299–301, 304–5 DAMOCLES Study Group, on DSMBs, 211, 214 Data and Safety Monitoring Board (DSMB), 133, 209, 210, 211–4, 216 closing trial for futility, 225–6 independence, 223 reporting about, 240 databases, 131 freezing, 144 data collection, 77–102 finishing, 143–4 initiation, 130–2 specification in protocols, 62 data entry, forms, 125–6 data monitoring committees, 66 dates, recording, 95 decimal places, 92, 95, 154, 354 Declaration of Helsinki, 67 delays, 37, 38, 39, 134

INDEX after data collection, 144 after randomisation, 242–3 dental studies, split-mouth designs, 10, 293, 312–7, 367, 368 depression exercise for, 377 major depressive disorder, 438–9, 444 descriptive data, 77 variables, 78 design effects (DE) change from baseline, 338–9, 344, 346, 346–7 cluster trials, 381–2 comparing, 343–4 repeated measures design, 338–44, 347 stepped wedge designs, 391, 405–7, 412 see also analysis of covariance diabetes cluster trial, 376 see also type 2 diabetes dichotomisation, continuous numerical data, 431–2 diclofenac, 280–1 difference of differences, 323, 327 difference of ratios, 323, 327 digit preference, 92 dilution, 37–8 intention-to-treat and, 40 dimeticone, tungiasis, 448 discrete numerical data, 79 disease-free survival, 89 docetaxel, non-small-cell lung cancer, 269 documentation, 39 randomisation, 119 see also forms; protocols; publication dosage, 35 warfarin, 362 dose–response designs, 6, 7 multiple doses, 275–9 dot plots, 257 double-blind RCT, 15–7, 117–8 code breaking, 118, 128–9 double consent design (Zelen), 473, 474 double triangular sequential design, 453–4, 455 dressings, blinding and, 91 drop-out, see withdrawals drug reactions, 73–5 dummy variables, 177, 266, 267 duplicate registrations, 127, 128 dynamic allocation, 111–3 by minimisation, 111–3, 115, 241 dyslipidaemia, 255–7, 306, 383–4 early learning composite score, iron deficiency, 447 early reviews, 214–8 early stopping, 53, 210, 214, 217, 219

INDEX for futility, 226–8 economic evaluation, 90 eConsent, 69 effect size, 33, 35, 43, 147, 198–200, 461 dose–response designs, 276 multiple interventions, 270 pilot studies for, 439, 441–2 reporting, 233, 243–4, 246 very small trials, 471–2 electronic informed consent forms, 69 elicited prior distribution, clinical opinion as, 466, 467 eligibility, 31–2, 40 biomarkers for, 417 in protocols, 56–8 recording on forms, 95 run-in periods, 307 ELISA kits, 88–9 emergency procedures, 128–9 endotracheal tubes, laryngeal mask airways vs, 221–2 endpoints, 80 assessments, 238 forms, 139 burns, 83 composite, 85 delay, 38 intermediate, 89 measurement, 80–90 non-inferiority trials, 361–2 reporting, 252–4 single-measure, 85–6 surrogate, 89, 451 survival-time, 429–30 variables, 78, 82 baseline, 92–3 stepped wedge designs, 392 see also binary endpoints; continuous endpoints; multiple endpoints enhanced care blood pressure, 382–3 diabetes, 376 enrichment designs, 417, 420–2 hypotheses, 419–20 null hypothesis, 419–20 ENSG5 protocol (neuroblastoma), 55, 56 epidermal growth factor receptor, 418, 419, 431–2 breast cancer, 7, 113 equipoise, 33 equivalence trials, 357, 370–4 erlotinib, non-small-cell lung cancer, 433 errors in data capture, 131–2, 139–40 in measurements, 101

527 rounding error, 354 Type I errors, 188, 198, 201 Type II errors, 187–8, 198 estimated difference, 150, 188, 189 estimated mean, 152 estrone levels, 88–9 ethics, 24 cluster trials, 376–7 consent, 32–3, 239 cross-over trials, 308 discussion in protocols, 67–70 22 factorial design, 37 Zelen randomised consent designs, 472–3 European Organization for the Research and Treatment of Cancer (EORTC) guideline, 102 trials, 91 European Union, trials registry, 134 event-free survival, as endpoint variable, 82 evidence-based medicine (EBM), 22 exchangeable autocorrelation, 332–3, 338–44, 348 exclusion, selective, 40–1, 57, 248 exenatide, 238 exercise depression in care homes, 377 fuel metabolism, 309 EXPEL trial (2016), 66, 470–1 protocol, 60 external evidence, Bayesian statistics, 462, 464 external-pilot studies, 435, 437–44 eyelid surgery, anaesthesia, 293–4, 295, 296, 298–9, 303, 309–10, 311 eyes, as matched organs, 355 facemasks, 377, 384–5 factorial designs, 265, 279–88 22 factorial design, 36, 37, 279–81, 282–8 falciparum malaria, 7 false errors, 140 false-negative errors (Type II errors), 187–8, 198 false-positive errors (Type I errors), 188, 198, 201 feasibility studies, 12, 435–7 feedback, 137–41 femoropopliteal peripheral artery disease, 363–4 ferric carboxymaltose, 320, 323–4, 326, 333–4, 335, 336–8, 339, 340, 353, 354 fever, antipyretics, 39 Fisher F-test, 303 fitted linear regression equation, 276 fixed-effect regression model, 329, 330 cluster trials, 380 stepped wedge designs, 400 flat rule-of-thumb pilot studies, 439–40 fluticasone, 284, 285–6

528 follow-on trials, 145–6 follow-up forms, 95–6 forest plots, 478–9, 480 forms, 125–6 case records, 130–1 data recording, 63, 93–8 inclusion in protocols, 72 late or missing, 138–9 layout, 93–4, 125 role of trials office, 124 timetables for receipt, 130 training in completion, 129 validation, 139–40 see also under consent free breakfast (schools), 409, 410, 413–4 fuel metabolism, exercise, 309 Fuji IX Gp, ChemFlex (dental) vs, 316 full design, stepped wedge designs, 397 fundamental equation, 189, 190 funding, 234–5 Furlow palate repair technique, 281 futility, interim analysis finding, 224–8, 469, 470–1 gastrectomy, peritoneal lavage, 60, 66, 216, 470–1 gastric cancer, 197–8, 470–1 gastric emptying time, 4–5, 206 gastro-oesophageal reflux, 105 gemcitabine, non-small-cell lung cancer, 269 general anaesthesia, children, 221–2 generalisability, 31 genomic targets, 417–34 genotype-guided dosing, warfarin, 362 geographical trial centres, 70, 82 geometric means, 351 glass-ionomer cement, resin-based fissure sealant vs, 314–5, 317 glaucoma, 9, 47, 61, 63, 68, 69, 80 glioblastoma, 18 glucose levels, exercise, 309 gold standard, 192 Good Clinical Practice, 24, 67 grading, 91–2 graphics, 257–8 repeated measures design, 351–2, 352, 353 group names, on published papers, 71 group-sequential designs, 454 guidelines, 43–4 cross-over trials, 317 health-related quality of life, 102 non-inferiority trials, 373–4 in protocols, 75 on publication, 230–1, 260–2 response measures, 83

INDEX Guillain–Barré syndrome, 246 gynaecological cancer, sexual function, 98–9 haloperidol, 34, 239, 246, 270, 273–4 hand surgery, pain prevention, 6, 117–8, 275–9 Haybittle stopping rule, 221–2 hazard ratios (HR), 168, 196–7, 464–5 head louse infestation, 372 healing, 38, 79, 143 health-related quality of life, 87, 90 guideline, 102 SF-36 instrument, 100 heart failure, 320, 323–4, 326, 333–4, 335, 336–8, 339, 340, 353, 354, 462, 463 HEART score, 408–9 heat-labile toxin patches, travellers’ diarrhoea, 257, 258 hepatitis B vaccines, 5 hepatocellular carcinoma, 34, 52, 224, 240–1, 246, 259, 467 HER2-positive breast cancer, 7, 113 heroin dependence, 220, 271–2 heterogeneity, 480–1 hierarchy of design, 15–20 high dose rate brachytherapy, 437 Hill, Austin Bradford (1897–1991), 20–1 hip protectors, 10–1, 29, 35, 379 historical aspects, 20–2 historical comparisons, 17 HIV infection, 348 homeopathy, 29–30 hospice care, music therapy, 441, 443 hot-line systems, communications, 128–9 hypercholesterolaemia, 383–4 hypertension, 8–9, 383–4 hypotheses enrichment designs, 419–20 equivalence trials, 370 non-inferiority trials, 358, 359 unstructured designs, 266 see also alternative hypothesis; null hypothesis hysterectomy, vaginal cuff infections, 475 ibuprofen, 39 ICH E6 (R2) (2016), 75 identifiers of subjects, 335–7 of trials, 127 image-guided radiotherapy, 437 implants (dental), 10 incomplete designs, stepped wedge designs, 398 inflation noncentral t-distribution, 443–4

INDEX upper confidence limits, 443 information, for consent process, 33, 69 information fraction, 226 informative missing assessments, 90 initiation of trials, 121–35, 185–6 insulin, 6, 289–90, 291, 372 intellectual disability, 245–6, 273–4 consent process, 239 intention-to-treat (ITT), 40–1, 371, 373 Zelen randomised consent designs, 473 interactions, in factorial trials, 280–1, 285, 287, 288 intercepts, 42 interim analyses, 180–2, 211, 214, 219–28, 467–72 early reviews, 214–8 examples, 215, 216, 217 in protocols, 66 stopping rules for, 246 interim results, publication, 230 interleukin-6 receptor inhibition, 242, 275, 277 intermediate endpoints, 89 internal data monitoring, 132–41, 211, 246 internal information, Bayesian statistics, 464 internal-pilot studies, 205–6, 435, 445–7 international trials registries, 134 interventions best, 268–9 complex, 414 describing in papers, 237 describing in protocols, 53–6 more than two, 265–94 see also yes-intervention stepped wedge designs intra-cluster correlation (ρ), 380–1, 403–7 intraocular lenses, 479–80 intraocular pressure, 80 introduction, see background iodine-131-lipiodol, 467 iron deficiency non-anaemic, 447 see also ferric carboxymaltose ISA247 (calcineurin inhibitor), 278–9 jittering devices, graphs, 351, 352 journals choice of, 145 protocols described in, 141 Royal Statistical Society, 22 Kansas City Cardiomyopathy Questionnaire (KCCQ), 350, 351, 353 Kaplan–Meier (K–M) survival curve, 163–7 KRAS/BRAF biomarker, 433 KRAS mutations, 419

529 laboratory measures, 87–9 language (wording), in papers, 233 laparoscopic colectomy, trial form, 96 laquinimod, 87 large dose–response design, 7 large simple trials, 461–2 laryngeal mask airways, endotracheal tubes vs, 221–2 late forms, 130 launch of trial, 134, 137 lay members, approval committees, 126 layout forms, 93–4, 125 papers, 145 questionnaires, 98–9 lidocaine, eyelid surgery, 294, 296, 298–9, 303, 309–10, 311 likelihood distribution, 463, 464–5 very small trials, 472 Likert scale, 100 limbal relaxing incisions, 479–80 limited resources, 201–2 Lind, James (1716–1794), 21 linear profiles, repeated measures design, 335–8 linear regression equation, 42, 276, 463 lists of patients, 116–7, 119 literature reviews, 466 for protocols, 49 literature searches, 477 lithium, 456, 457 LivRelief cream, 438 logistic regression, 171, 267, 402–3 logit (p), 171, 173 log-rank test, 166, 168, 179 longitudinal designs (repeated measures), 8, 86–7, 319–56 low back pain, 236–7, 280–1 lower limits (LL), in non-inferiority trials, 359, 361, 363 magnesium stents, coronary arteries, 11, 29, 30, 201 main effects, factorial trials, 279, 280–1, 285–6 major depressive disorder, 438–9, 444 malaria, 7, 159, 160, 170–1 management of patients data on, 77 on forms, 72 manipulative therapy, 280–1 Mann–Whitney U test, 156 masking, see blinding masks (facemasks), 377, 384–5 matched organs, 354–5

530 matched-pair trials, 293–305, 362, 364–6, 367, 369–70 matching, randomisation vs, 16 matrices, stepped wedge designs, 396, 397, 407, 413 mattresses, 452–3, 454–5 McNemar test, 300, 301, 314, 315 mean baseline – mean post-randomisation assessments, repeated measures design, 325 means, 149, 152, 188 comparing, 150, 169–70, 226 cluster trials, 379–80 factorial designs, 285 matched-pair trials, 295–9 non-inferiority trials, 362 trial size, 190–1 population means, 43, 152 repeated measures design, 351 weighted, 464 measurement, 77–102 endpoints, 80–90 questionnaires as instruments, 98 repeated, 8, 86–7, 319–56 two-site, 80, 162 see also outcome measures median difference, 155, 156 medians, 155–6 melphalan, myeloma, 81, 191 mesothelioma, 287 meta-analyses, 476–81 metastases nasopharyngeal cancer, 204–5 prostate cancer, 192–3 methicillin-resistant Staphylococcus aureus (MRSA), 361 metronidazole, vaginal cuff infections, 475 minimisation dynamic allocation by, 111–3, 115, 241 malfunction, 259 missing assessments, 90 bias from, 131 repeated measures design, 348–9 see also censored data missing forms, 130, 138–9 missing values, 178, 181, 180 mitomycin, mesothelioma, 287 mixed regression models, 330, 380 modifications, protocols, 73–5, 125 modified overt aggression scale (MOAS), 245–6 monitoring, see internal data monitoring more than one question of interest, 36 more than two interventions, 265–94 multicentre trials, 20 authorship, 71

INDEX internal data monitoring, 133 large simple, 461–2 randomisation in, 114 recruitment problems, 138 stratified randomisation, 109–10 very small, 471–2 multiple daily insulin injections, 6, 289–90, 291 multiple endpoints, 87, 180 trial size, 186, 203–4 multiple interventions, 265–94 multiple measures, 86 multiple sclerosis, 87 multiple sites per subject, 80, 355–6 music therapy, hospice care, 441, 443 myeloma, 81, 82, 191, 20ε3 naltrexone, 220, 271–2 nasopharyngeal cancer, chemoradiotherapy, 29, 50, 54, 96–8, 204–5, 217–8, 244, 466 national trials registries, 134 neuroblastoma, 55, 56, 110–1, 166–8, 174, 178 neutral parties, 115 new developments, during trial, 145, 182 newsletters, 141 nitric oxide monitoring, asthma, 241 no-intervention stepped wedge designs, 391, 392, 393 noise (ε), 42, 169, 329 nominal data, 78–9, 91–2 noncentral t-distribution, inflation, 443–4 non-equivalence, 370 non-inferiority margin, 233 non-inferiority trials, 7, 12–3, 357–74 non-linear profiles, repeated measures design, 338 non-randomised comparisons, 120 non-small-cell lung cancer, 38, 269, 433 normal distributions, 148–9 nuisance covariate, 176 null hypothesis, 151 enrichment designs, 419–20 equivalence trials, 370 non-inferiority trials, 358 standard errors if true, 159 survival times, 167 testing, 104 unstructured designs, 266 wrong rejection, 187 number needed to treat, 159–60 number of events (E), 419 numerical data, 79 nurse-led medicines monitoring, care homes, 409–10

INDEX objectives, in protocols, 49–52 observers, choice of, 91 odds ratio (OR), 170–2 confidence interval for, 183 matched-pair trials, 300 multiple interventions, 267 trial size, 194–5 office, see trial office ’off study’ (term), 125 one baseline – one post-randomisation assessment, repeated measures design, 322–4 1-sided alternative hypotheses, 206, 419–20 online systems, randomisation, 242 on-study forms, 138 open cohort stepped wedge designs, 391, 392, 394, 398–408 opening date, 137 open-label trials, 17, 36 open questions, 95, 100 optimism, 199 oral hypoglycaemic agents, 6, 289–90, 291 oral lichen planus, cyclosporine, 28, 55, 59–60, 65–6, 71, 141, 158, 160, 171, 236, 244, 247 oral soft tissue augmentation, 368 oral (verbal) consent, 239–40 ordered categorical data, 79, 161–3, 173, 194–6 outcome measures, 51–2 clusters, 80 out-of-range responses, 139–40 outsourcing, 182 ovarian cancer, 478–9 Oxford program, 22 pain chest, 408–9 chronic non-cancer, 308, 367–8 low back, 236–7, 280–1 partial thickness burns, 143 postoperative, hand surgery, 6, 117–8, 275–9 visual analogue scores, 84, 85, 294 paired designs, 9, 293–317, 361, 362 paper forms, 131 paracetamol, 39, 280–1 parallel designs, 12, 27–44, 52 equivalence trials, 372 repeated measures design, 319–56 three-group design, 36 two-group design, 4–5 partial thickness burns, 32, 143 patients baselines, 253 changing options, 40 see also consent

531 describing in papers, 236–7, 250–2 lists of, 116–7, 119 as observers, 91 registration, 129–30 selection, 30–2, 103 specification in protocols, 56–8 selective exclusion, 40–1 self-reported outcomes, 90 see also management of patients Pearson correlation coefficient, 331 peer review, of protocols, 46 percentages, see proportions (p) period effect, cross-over trials, 309 peripheral artery disease, 363–4 peritoneal lavage, gastrectomy, 60, 66, 216, 470–1 permutations, blocks, 106–7 per-protocol analyses, 371 per-protocol summaries, 41 PET-CT imaging, prostate cancer, 192–3 Peto, R., stopping rule, 222 pharmaceutical companies, funding by, 235 phase I trials, 23 phase II trials, 23 phase III trials, 12 phase IV trials, 24 phased implementation designs, see stepped wedge designs phenylketonuria, 327–8, 341 phototherapy, atopic eczema, 321, 325, 341, 342 phrasing, in papers, 233 physician discretion, 248 pilot studies, 435–49, 437–44 internal-pilot studies, 205–6, 435, 445–7 pilot trials, 205 placebos, 19–20, 34, 117–8, 248 vs multiple interventions, 270, 271–2, 273–4, 280–1 planning values, 195, 197, 205–6 Pocock, S.J., on randomisation, 15 polypills, 308 population difference, 150, 153 population means, 43, 152 POST equation, 340, 344 posterior distribution, 464–5, 471 post-randomisation assessments repeated measures design, 319, 320, 322, 324, 325, 326, 327, 328, 329, 331 trends, 342 post-treatment values, 36–7 power, 243 complex structure comparisons, 289–90 significance tests, 187–8, 200–1 precision of measurement, 91–2

532 pre-clinical research, 22–3 predictive classifiers, defined, 418 predictive markers, 417–34 predictive power test, 470 predictive probability test, 469 predictive value, 192 prednisone, myeloma, 81, 191 premature publication of data, 71 pre-randomisation assessments, 92, 322, 330 pre-roll-out periods, see run-in periods presentations, 143 pressure-relieving support surfaces, 50, 53, 54, 86 assessments, 61–2 data collection, 62–3 ethics of trial, 67 patient selection, 57–8 randomisation, 58, 59 see also bed sores preterm infants analgesia, 363, 365–6, 370 weaning from ventilation, 217 prilocaine, eyelid surgery, 294, 296, 298–9, 303, 309–10, 311 primary outcomes, 52, 238 prior distribution, 464–5, 466, 467 updated, 469 very small trials, 472 probability density functions, 148–9 prognostic data, 77 variables, 82, 175–6 stratified randomisation, 111 progression-free survival, 245 proof of concept studies, 448–9 proportional-hazards regression model, 22 proportional odds model, 173 proportions (p), 78, 156–9, 170–2 comparing, 226, 298, 362, 365 confidence intervals, 182–3 of discordant pairs, 304 trial size, 191–4 prostate cancer, 38, 141, 460–1 PET-CT imaging, 192–3 radiotherapy, cranberry juice, 111–2 protocols, 45–76, 124–6 defined, 45 distribution, 128 on DSMB, 216 early procedures, 41 follow-on trials, 146 listed, 76 modifications, 142 non-adherence, 138 publication of, 141

INDEX repeated measures design, 349–50 systematic reviews, 477 proxies, consent by, 68 Pseudomonas aeruginosa, 4, 295, 299–301 pseudo-randomisation, 120 psoriasis, 278–9, 280 psychiatric crisis, 475 psychotherapy, major depressive disorder, 438–9 publication (reporting), 41, 229–62 cluster trials, 388–9 dose–response designs, 279 factorial designs, 287–8 of findings, 248–58 non-inferiority trials, 373 policy, 71 preparing for, 142–5, 229 of protocols, 141 repeated measures design, 350–4 timing, 229–30 publication bias, 477 public health campaigns, 375 publicity, 141 p-values, 151, 159 complex structure comparisons, 290 interim analyses on, 219, 222, 223 trials with multiple endpoints, 180, 203 qualitative data, 78–9 quality of life, 87, 90 guideline, 102 music therapy, 441, 443 SF-36 instrument, 100 quantitative data, 79 questionnaires, 98–101 question of interest, 4, 27, 29–30, 45 arising from trial, 145–6 literature reviews for, 49 more than one, 36 refinement, 122 random effects, repeated measures trials, 330–1 randomisation, 12, 14–5, 43, 103–20 adaptive designs, 457–9 allocation ratio, 36 application, 113–5 blocks, 106–11, 114, 115, 282 cluster trials, 378–9 cross-over trials, 308 describing in papers, 240–3 describing in protocols, 58–61 dynamic allocation by, 111–3 evidence strength, 15–6 external-pilot studies, 438

INDEX factorial trials, 282–3 historical aspects, 20–2 malfunction, 259 multiple interventions, 271 process, 115–8 rationale, 104 split-mouth designs, 312 stepped wedge designs, 392, 398 stratified, 109–11, 115, 178, 179, 378 see also biomarker-stratified designs randomised block designs, 107–8 randomised consent designs (Zelen), 472–6 random numbers, 104–5 range checks, 132, 139–40 rank sum test, 155, 161, 162 rare diseases, 29, 471–2 rating, literature, 477 ratios of differences, repeated measures design, 323, 327 recall bias, 101 recommended method (Newcombe and Altman), 158 recording of data, 63, 93–101 see also forms; publication recruitment, 19, 103 cluster trials, 388 large simple trials, 462 patient selection, 30–2 specification in protocols, 56–8 progress of, 137–8 review by DSMB, 218 split-mouth designs, 312–3 see also randomisation reduced designs, stepped wedge designs, 398 refusal of therapy, 37–8, 39, 40, 68, 125, 142, 388 registration of patients, 129–30 registration of trials, 126–8, 234, 477 registries, 134–5 regression coefficients, 42, 286, 463 repeated measures design, 327 standard errors, 335–7 trial size, 186 regression lines, 36–7 regression methods, 267 time-to-event, 173 regression models, 169–78 cluster trials, 379–80 Cox proportional hazards, 173–4, 175–6, 179 factorial trials, 286 proportional-hazards, 22 repeated measures design, 325, 329–31 stepped wedge designs, 400–4 regress SBP Intervention (command), 387

533 regulatory requirements, 24–5 consent, 33 protocols and, 73–5 relevance of data, 78 renal cell carcinoma, 245, 422, 427, 431–2 repagination, protocols, 75 repeated measures, 8, 86–7, 319–56 repeat forms (follow-up forms), 95–6 reporting, see publication reporting (routine reporting), internal data monitoring, 133–41 resin-based fissure sealant, glass-ionomer cement vs, 314–5, 317 response bias, 100–1 response rate, 82, 244 retrospective checking, 40 rheumatoid arthritis, 275, 277, 369 risk of bias, 480–1 risk ratio, 479 risperidone, 239, 246, 273–4 rituximab, Sjögren’s syndrome, 344–5, 346–7 robust methods, for standard errors, 335–7 Roche, funding by, 235 root caries repair, 119 rounding error, 354 routine reporting, internal data monitoring, 133–41 Royal Statistical Society, journal, 22 run-in periods, 92, 307 stepped wedge designs, 392, 395 safety endpoint, 363–4 safety monitoring, 209–28 see also adverse events; Data and Safety Monitoring Board salary-slip format, 116 salmeterol, 284, 285–6 sample means, 43 samples, 149 size, see size of trial sapropterin dihydrochloride, 327–8, 341 SASSAD (six-area six-sign atopic dermatitis) score, 13, 14, 36, 150, 152, 153, 154, 156, 157, 169–70, 204, 254, 257, 321 schizophrenia, 34, 270 sealed envelopes, 60, 116 searches of literature, 477 selection bias, 103–4, 114, 388 selection design, 267–9 selective exclusion of patients, 40–1 self-reported outcomes, 90 sensitivity, 192–3 sequential designs, sample size, 452–6 serial-correlation, see autocorrelation

534 serious adverse events (SAV), see adverse events sexual function, gynaecological cancer, 98–9 SF-36 quality of life instrument, 100 significance tests, 150–1, 155, 186–7, 200–1, 221–2 adaptive threshold designs, 432–3 of baseline imbalances, 252, 253 complex structure comparisons, 290 power, 187–8, 200–1 Type I and II errors vs, 188 single-arm external-pilot studies, 437–8 single-arm trials, 11 very small, 472 single-blind design, 17 single consent design (Zelen), 473, 475 single measure endpoints, 85–6 single triangular sequential designs, 456, 457 sites of disease, multiple, 86, 87, 355–6 measurement, 80 six-area six-sign atopic dermatitis (SASSAD) score, 13, 14, 36, 150, 152, 153, 154, 156, 157, 169–70, 204, 254, 257, 321 6-minute walk test, 320, 323–4, 326, 333–4, 335, 336–8, 339, 340, 354 size of trial, 63–4, 185–207, 419–20 adaptive approaches, 452–61 adaptive threshold designs, 432–3 biomarker-stratified designs, 427–30 cluster trials, 381–6, 388 complex structure comparisons, 289–90 cross-over trials, 310–1 on early decisions, 214 enrichment designs, 421–2 factorial designs, 283–4 justifying, 243–6 large simple trials, 461–2 matched-pair trials, 302–5 multiple interventions, 269, 270–1, 275–7 non-inferiority trials, 366–70 pilot studies for, 437, 439–40, 443–5, 446 repeated measures design, 344–7 revising, 205–6, 217–8, 219 small trials, 471–2 split-mouth designs, 315–7 stepped wedge designs, 391, 408, 412–4 too small, 151 Sjögren’s syndrome, 344–5, 346–7 small dose–response design, 6 small parallel two-group design, 4–5 small trials, 33, 185, 186, 471–2 social desirability bias, 101 social media, for consent process, 69 software, 148, 154–5 randomisation, 242

INDEX for sample size calculation, 207 sorafenib, non-small-cell lung cancer, 433 specificity, 192 spike plots, 294 SPIRIT 2013 statement, 75 split-mouth designs, 10, 293, 312–7, 367, 368 SQCP01 trial (2006), protocol, 48 SQGL02 trial (1999), 63 protocol, 47, 68, 69 SQNP01 trial (1997), 70, 466 protocol, 50, 54 SQOLP01 trial (1999) protocol, 55, 61, 65–6, 71 randomisation, 59 standard deviations (SD), 42, 152, 153 pilot studies for, 442, 446 re-estimation, 205–6 within-subject, 302 between-subject, 302 standard errors, 149, 153, 188, 463 combining trial results, 478 of the difference, 152 dose–response designs, 276, 277 if null hypothesis true, 159 non-inferiority trials, 362 regression coefficients, 335–7 standardised effect size, 199 standard normal distributions, 148–9 Stata (software package), 148 statistical models, 42 statisticians, 124 as members of DSMB, 213 statistics in papers, 243–8, 262 specification in protocols, 63–6 see also Bayesian statistics steering committees (TSC), 122, 209 relationship with DSMBs, 212, 213 stents, coronary arteries, 11, 29, 30, 201 stepped rule-of-thumb pilot studies, 440–1 stepped wedge designs, 391–413 notation, 392–4 steroids, oral lichen planus, 55 STITCH2 trial, 383–4, 385–6 stopping rules, 214, 221–3, 246, 455, 467, 469 strata, enrichment designs, 420 stratified randomisation, 109–11, 115, 178, 179, 378 see also biomarker-stratified designs stroke trials, 68, 227, 394, 395, 397, 399–400, 401, 402–3, 404, 405 Student t-test, 153, 362, 379 see also t-distribution summaries

INDEX per-protocol, 41 in protocols, 48 superiority trials, 12, 357, 358, 373 supplies, 128 surrogate endpoints, 89, 451 survival, 79 disease-free, 89 event-free, as endpoint variable, 82 Kaplan–Meier (K–M) survival curve, 163–7 survival curves, 163–7 survival time endpoints, 429–30 survival times censored data, 163, 165–6 null hypothesis, 167 systematic error, 101 systematic reviews, 126–7, 232–3, 451, 463, 476–81 Cochrane Collaboration, 22, 476 T+ (censored data), 80 tables dose–response designs, 279 repeated measures design, 350–1 tamoxifen, hepatocellular carcinoma, 52 t-distribution, 443–4 teams, 25, 121–2 communications networks, 128 names on published paper, 71, 233–4 personnel changes, 138 protocol preparation, 46 running trial office, 123–4 specification in protocols, 70 writing committees, 143 telephone-based randomisation, 59–60 termination, see early stopping test kits, laboratory measures, 87 test treatments, 35 before-and-after design, 17 thalidomide, myeloma, 81, 191 three-group design, 36 unstructured, 6 three-treatment-three-period cross-over trial, 308 thrombolysis, 394, 395, 399–400, 401, 402–3, 404, 405 tied observations, 155 time scale, 19 time-to-event, 79–80, 163–4 example, 464–5 graphics, 257 regression methods, 173 single measure endpoints, 85, 86 trial size, 196–8, 462 time-varying covariates, 177

535 timing of assessments, repeated measures design, 347–8, 350 timing of publication, 229–30 tocilizumab, 275, 277 toric intraocular lenses, 479–80 total variance, stepped wedge designs, 412 toxicity, acceptable increase, 359 training, 129 transdermal fentanyl, 367–8 trastuzumab, HER2-positive breast cancer, 113 travellers’ diarrhoea, heat-labile toxin patches, 257, 258 treatment effect, repeated measures design, 354 trends post-intervention, 342 in repeated measures trials, 329 trial office, 70, 114, 115, 122–4 internal data monitoring, 132–3 management of forms, 130 trial steering committee, see steering committees (TSC) triangular designs double sequential, 453, 455 single sequential, 456, 457 t-test, see Student t-test tumours, measurement, 83 tungiasis, 448–9 two-group designs, 27–44 two-period – two-treatment cross-over trials, 305, 307, 308–10, 364–6 two questions, 36 two-site measurements, 80, 162 two sites, 355 22 factorial design, 36, 37, 279–81, 282–8 Type I errors, 188, 198, 201 type 2 diabetes cross-over trial, 372 exenatide, 238 intensive therapies, 6, 289–90, 291 Type II errors, 187–8, 198 tyrosine kinase 3 mutation, 423 UKW3 trial (1992), protocol, 56 unanticipated events data on, 77 drug reactions, 73–5 uniform correlation (exchangeable autocorrelation), 332–3, 338–44, 348 unordered categorical data (nominal data), 78–9, 91–2 unpublished trials, 126 unstructured autocorrelation, 333

536

INDEX

unstructured designs, 265–9, 272 three-group design, 6 updated priors, 469 very small trials, 472 updates, 137–41 upper confidence limits, 442, 443 upper limits (UL), in non-inferiority trials, 359–60, 363

ventilator weaning, 243, 252–4, 257, 258 verbal consent, 239–40 very small trials, 471–2 video consultation, for consent process, 69 vinblastine, mesothelioma, 287 vinorelbine, non-small-cell lung cancer, 269 visual analogue scores, pain, 84, 85, 294 vulnerability of subjects, 31

vaginal cuff infections, hysterectomy, 475 validation, 132 data in forms, 139–40 questionnaires, 99 vandetanib, non-small-cell lung cancer, 433 variability assessments in repeated measures design, 351 biological, 13–4 variable block size, 108 variables, 78, 80–2 for databases, 131 dummy, 176–7, 266, 267 endpoints, 78, 82 baseline, 92–3 stepped wedge designs, 392 prognostic data, 82, 175–6 stratified randomisation, 111 recording on forms, 95 stratification, 111 variance, 152, 299 analysis of (ANOVA), 273, 299, 303 contrast (C), 342–3 variance-covariance estimate, 335–6 varicose veins, 438 vascular endothelial growth factor, breast cancer, 426 VAS scores (visual analogue scores), pain, 84, 85, 294 velopharyngeal function, 288

warfarin, genotype-guided dosing, 362 washout, cross-over trials, 306, 308 web-based randomisation, 241–2 web-based tables, 251 weighted data, real data vs, 465 weighted mean, 464 WHO, trials registry, 134 Wilcoxon Rank-Sum test, 156 Wilms’ tumour, 56, 215 withdrawals (of consent), 33, 37–8 withdrawals (of subjects), 32 cross-over trials, 307 phrasing in protocol, 125 repeated measures design, 348–9 timing of intervention, 283 on trial size calculation, 64, 198, 202–3 unacceptable interventions, 142, 248–9 within-subject correlation coefficients, 313, 355 within-subject standard deviations, 302 wording, in papers, 233 wound healing, 38, 79, 143 writing committees, 143 yes-intervention stepped wedge designs, 391–2, 393, 396 Zelen randomised consent designs, 472–6 Z-scores, 87, 88 z-test, 168, 188, 221