Design and Analysis of Pragmatic Trials 9780367627355, 9780367647360, 9781003126010

This book begins with an introduction of pragmatic cluster randomized trials (PCTs) and reviews various pragmatic issues

223 60 7MB

English Pages 214 [215] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Cover
Half Title
Series Page
Title Page
Copyright Page
Table of Contents
Author Biographies
Chapter 1 Pragmatic Randomized Trials
1.1 Introduction
1.1.1 Statistical Issues in Pragmatic Randomization Trials
1.2 Cluster Randomized Designs
1.2.1 Completely Randomized Cluster Trial Design
1.2.2 Restricted Randomized Designs: Strategies to Improve Efficiency for CRTs
1.2.2.1 Matched-Pair Cluster Randomized Design
1.2.2.2 Stratified Cluster Randomized Design
1.2.2.3 Covariate-Constrained Randomized Design
1.2.3 Multiple-Period Cluster Randomized Designs
1.2.3.1 Longitudinal Cluster Randomized Design
1.2.3.2 Crossover Cluster Randomized Design
1.2.3.3 Stepped-Wedge Cluster Randomized Design
References
Chapter 2 Cluster Randomized Trials
2.1 Introduction
2.2 Continuous Outcomes
2.2.1 Standard Two-Sample t-Test
2.2.2 Adjusted Two-Sample t-Test
2.2.3 Generalized Estimating Equation Method
2.2.4 Mixed-Effects Linear Regression Models
2.3 Binary Outcomes
2.3.1 Standard Pearson Chi-Square Test
2.3.2 Adjusted Chi-Square Test
2.3.3 Ratio Estimator Chi-Square Test
2.3.4 Generalized Estimating Equation Approach
2.3.5 Generalized Linear Mixed Model Approach
2.4 Count Outcomes
2.4.1 Adjusted Normality Test
2.4.2 Ratio Estimator Method
2.4.3 Generalized Estimating Equation
2.5 Cluster Size Determination for a Fixed Number of Clusters
References
Chapter 3 Matched-Pair Cluster Randomized Design for Pragmatic Studies
3.1 Introduction of Matched-Pair Cluster Randomized Design
3.2 Considerations for Pragmatic Matched-Pair CRTs
3.2.1 Impact of Correlation
3.2.2 Impact of Missing Data
3.3 Matched-Pair Cluster Randomized Design with Missing Continuous Outcomes
3.3.1 Sample Size Estimation Based on GEE Approach
3.3.2 Relative Efficiency of GEE Approach vs Crude Adjustment
3.3.3 Adjustment of Inflated Type I Error
3.3.4 Sensitivity Analysis
3.3.5 Example
3.4 Matched-Pair Cluster Randomized Design with Missing Binary Outcomes
3.4.1 Sample Size Estimation Based on GEE Approach
3.4.2 Relative Efficiency of GEE Approach vs Crude Adjustment
3.4.3 Adjustment of Inflated Type I Error and Sensitivity Analysis
3.4.4 Example
3.5 Further Readings
Appendix
A.1 Eigenvalues of the Correlation Matrix
A.2 Proof of Theorem 1
References
Chapter 4 Stratified Cluster Randomized Design for Pragmatic Studies
4.1 Introduction of Stratified Cluster Randomized Design
4.2 Considerations for Pragmatic Stratified CRTs
4.3 Stratified Cluster Randomized Design with Continuous Outcomes
4.3.1 Sample Size Estimation Based on GEE Approach
4.3.2 Relative Sample Size Change Due to Varying Cluster Size
4.3.3 Example
4.4 Stratified Cluster Randomized Design with Binary Outcomes
4.4.1 Sample Size Estimation Based on CMH Statistic
4.4.2 Relative Sample Size Change Due to Varying Cluster Size
4.4.3 Estimation of Clustering Parameter
4.4.4 Example
4.5 Further Readings
References
Chapter 5 The GEE Approach for Stepped-Wedge Trial Design
5.1 Introduction
5.2 A Brief Review of GEE
5.3 Design SW trials with a Continuous Outcome
5.3.1 Accounting for Missing Data
5.3.2 Simulation Research
5.3.3 Adjusting for Underestimated Variances for Small Sample Sizes
5.3.4 Consideration of Efficiency and Robustness
5.4 Design SW Trials with a Binary Outcome
5.4.1 Extension to Outcomes from the Exponential Family
5.5 Longitudinal and Crossover Cluster Randomized Trials
5.5.1 Longitudinal cluster randomized trials
5.5.2 Crossover Cluster Randomized Trials
5.5.3 Comparison of and
5.5.4 Adjusting for Small Numbers of Clusters by the -Distribution
5.5.5 Accounting for Randomly Varying Cluster Sizes
Appendix A: Derivation of Equation (5.8)
Appendix B: Derivation of Equations (5.9) and (5.10)
Appendix C: Sample Size for Cross-Sectional SW Trials
Appendix D: Derivation of Equation (5.15)
Appendix E: Proof of Theorem 1
Appendix F: Proof of Theorem 2
References
Chapter 6 The Mixed-Effect Model Approach and Adaptive Strategies for Stepped-Wedge Trial Design
6.1 A Brief Review of Mixed-Effect Models
6.2 Sample Size Calculation Based on Cluster-Step Means
6.2.1 Commonly Used Correlation Structures
6.2.1.1 Exchangeable Correlation Structure
6.2.1.2 Nested Exchangeable Correlation Structure
6.2.1.3 Block Exchangeable Correlation Structure
6.2.1.4 Exponential Decay Correlation Structure
6.2.1.5 Proportional Decay Correlation Structure
6.3 Adaptive Strategies for SW Trials
6.3.1 Group Sequential Design for SW Trials
6.3.1.1 Bayesian Adaptive Design for SW Trials
6.4 Further Readings
References
Index
Recommend Papers

Design and Analysis of Pragmatic Trials
 9780367627355, 9780367647360, 9781003126010

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Design and Analysis of Pragmatic Trials This book begins with an introduction of pragmatic cluster randomized trials (PCTs) and reviews various pragmatic issues that need to be addressed by statisticians at the design stage. It discusses the advantages and disadvantages of each type of PCT and provides sample size formulas, sensitivity analyses, and examples for sample size calculation. The generalized estimating equation (GEE) method will be employed to derive sample size formulas for various types of outcomes from the exponential family, including continuous, binary, and count variables. Experimental designs that have been frequently employed in PCTs will be discussed, including cluster randomized design, matched-pair cluster randomized design, stratifed cluster randomized design, stepped-wedge cluster randomized design, longitudinal cluster randomized design, and crossover cluster randomized design. It demonstrates that the GEE approach is fexible to accommodate pragmatic issues such as hierarchical correlation structures, different missing data patterns, and randomly varying cluster sizes. It has been reported that the GEE approach leads to under-estimated variance with limited numbers of clusters. The remedy for this limitation is investigated in the design of PCTs. This book can assist practitioners in the design of PCTs by providing a description of the advantages and disadvantages of various PCTs and sample size formulas that address various pragmatic issues, facilitating the proper implementation of PCTs to improve healthcare. It can also serve as a textbook for biostatistics students at the graduate level to enhance their knowledge or skill in clinical trial design. Key features: 1. Discuss the advantages and disadvantages of each type of PCTs, and provide sample size formulas, sensitivity analyses, and examples. 2. Address an unmet need for guidance books on sample size calculations for PCTs. 3. A wide variety of experimental designs adopted by PCTs are covered. 4. The sample size solutions can be readily implemented due to the accommodation of common pragmatic issues encountered in realworld practice. 5. Useful to both academic and industrial biostatisticians involved in clinical trial design. 6. Can be used as a textbook for graduate students majoring in statistics and biostatistics.

Chapman & Hall/CRC Biostatistics Series Series Editors

Shein-Chung Chow, Duke University School of Medicine, USA Byron Jones, Novartis Pharma AG, Switzerland Jen-pei Liu, National Taiwan University, Taiwan Karl E. Peace, Georgia Southern University, USA Bruce W. Turnbull, Cornell University, USA Recently Published Titles Statistical Methods for Mediation, Confounding and Moderation Analysis Using R and SAS Qingzhao Yu and Bin Li Hybrid Frequentist/Bayesian Power and Bayesian Power in Planning Clinical Trials Andrew P. Grieve Advanced Statistics in Regulatory Critical Clinical Initiatives Edited By Wei Zhang, Fangrong Yan, Feng Chen, Shein-Chung Chow Medical Statistics for Cancer Studies Trevor F. Cox Real World Evidence in a Patient-Centric Digital Era Edited by Kelly H. Zou, Lobna A. Salem, Amrit Ray Data Science, AI, and Machine Learning in Pharma Harry Yang Model-Assisted Bayesian Designs for Dose Finding and Optimization Methods and Applications Ying Yuan, Ruitao Lin and J. Jack Lee Digital Therapeutics: Strategic, Scientifc, Developmental, and Regulatory Aspects Oleksandr Sverdlov, Joris van Dam Quantitative Methods for Precision Medicine Pharmacogenomics in Action Rongling Wu Drug Development for Rare Diseases Edited by Bo Yang, Yang Song and Yijie Zhou Case Studies in Bayesian Methods for Biopharmaceutical CMC Edited by Paul Faya and Tony Pourmohamad Statistical Analytics for Health Data Science with SAS and R Jeffrey Wilson, Ding-Geng Chen and Karl E. Peace

For more information about this series, please visit: https://www.routledge.com/ Chapman–Hall-CRC-Biostatistics-Series/book-series/CHBIOSTATIS

Design and Analysis of Pragmatic Trials

Song Zhang, Chul Ahn, and Hong Zhu

Designed cover image: © Song Zhang, Chul Ahn and Hong Zhu First edition published 2023 by CRC Press 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742 and by CRC Press 4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN CRC Press is an imprint of Taylor & Francis Group, LLC © 2023 Song Zhang, Chul Ahn and Hong Zhu Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microflming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright .com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact mpkbookspermissions@tandf .co.uk Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identifcation and explanation without intent to infringe. ISBN: 9780367627355 (hbk) ISBN: 9780367647360 (pbk) ISBN: 9781003126010 (ebk) DOI: 10.1201/9781003126010 Typeset in Palatino by Deanta Global Publishing Services, Chennai, India

Contents Author Biographies.............................................................................................. vii 1 Pragmatic Randomized Trials......................................................................1 1.1 Introduction .............................................................................................1 1.2 Cluster Randomized Designs ...............................................................6 References ....................................................................................................... 23 2 Cluster Randomized Trials......................................................................... 31 2.1 Introduction ........................................................................................... 31 2.2 Continuous Outcomes.......................................................................... 32 2.3 Binary Outcomes...................................................................................43 2.4 Count Outcomes.................................................................................... 52 2.5 Cluster Size Determination for a Fixed Number of Clusters ......... 59 References ....................................................................................................... 61 3 Matched-Pair Cluster Randomized Design for Pragmatic Studies ....... 65 3.1 Introduction of Matched-Pair Cluster Randomized Design ..........65 3.2 Considerations for Pragmatic Matched-Pair CRTs........................... 67 3.3 Matched-Pair Cluster Randomized Design with Missing Continuous Outcomes.......................................................................... 68 3.4 Matched-Pair Cluster Randomized Design with Missing Binary Outcomes...................................................................................80 3.5 Further Readings................................................................................... 88 Appendix......................................................................................................... 91 References ....................................................................................................... 93 4 Stratifed Cluster Randomized Design for Pragmatic Studies ........... 97 4.1 Introduction of Stratifed Cluster Randomized Design .................. 97 4.2 Considerations for Pragmatic Stratifed CRTs .................................. 98 4.3 Stratifed Cluster Randomized Design with Continuous Outcomes.............................................................................................. 100 4.4 Stratifed Cluster Randomized Design with Binary Outcomes.........105 4.5 Further Readings................................................................................. 112 References ..................................................................................................... 115 5 The GEE Approach for Stepped-Wedge Trial Design ........................ 119 5.1 Introduction ......................................................................................... 119 5.2 A Brief Review of GEE........................................................................ 122 5.3 Design SW trials with a Continuous Outcome............................... 125 5.4 Design SW Trials with a Binary Outcome....................................... 142 v

vi

Contents

5.5 Longitudinal and Crossover Cluster Randomized Trials............. 148 Appendix A: Derivation of Equation (5.8) ............................................... 161 Appendix B: Derivation of Equations (5.9) and (5.10)............................ 163 Appendix C: Sample Size for Cross-Sectional SW Trials ....................... 165 Appendix D: Derivation of Equation (5.15) ............................................. 166 Appendix E: Proof of Theorem 1 ............................................................... 167 Appendix F: Proof of Theorem 2 ............................................................... 168 References ..................................................................................................... 169 6 The Mixed-Effect Model Approach and Adaptive Strategies for Stepped-Wedge Trial Design ................................................................... 173 6.1 A Brief Review of Mixed-Effect Models .......................................... 173 6.2 Sample Size Calculation Based on Cluster-Step Means ................ 174 6.3 Adaptive Strategies for SW Trials..................................................... 185 6.4 Further Readings................................................................................. 193 References ..................................................................................................... 197 Index ..................................................................................................................... 201

Author Biographies Song Zhang is a professor of biostatistics from the Peter O’Donnell Jr. School of Public Health, University of Texas Southwestern Medical Center. He received his Ph.D. in statistics from the University of Missouri-Columbia. His research interest includes Bayesian hierarchical models with application to disease mapping, missing data imputation, joint modeling of longitudinal and survival outcomes, and genomic pathway analysis, as well as experimental design methods for clinical trials with clustered/longitudinal outcomes, different types of outcome measures, missing data patterns, correlation structures, and fnancial constraints. He has co-authored a book titled “Sample Size Calculations for Clustered and Longitudinal Outcomes in Clinical Research” (Chapman & Hall/CRC). As the principal investigator, Dr. Zhang has received fundings from PCORI, NIH, and NSF to support his research. Chul Ahn, Ph.D., is a professor at the Peter O’Donnell Jr. School of Public Health and the director of the Biostatistics Shared Resource at the Simmons Comprehensive Cancer Center, located at the University of Texas Southwestern Medical Center. He received his Ph.D. in statistics from Carnegie Mellon University and has published over 500 peer-reviewed papers addressing the design and analysis of clinical trials and epidemiological studies, including clustered and longitudinal studies. Additionally, he co-authored a book titled “Sample Size Calculations for Clustered and Longitudinal Outcomes in Clinical Research” (Chapman & Hall/CRC). Hong Zhu, Ph.D., is a Professor of Biostatistics in the Department of Public Health Sciences at the University of Virginia (UVA) School of Medicine and Director of Biostatistics Shared Resource at the UVA Comprehensive Cancer Center. She received her Ph.D. in Biostatistics from the Johns Hopkins University Bloomberg School of Public Health. Dr. Zhu’s methodological research interests include design and analysis of explanatory and pragmatic clinical trials, statistical methods for complex survival data (biased sampling, semi-competing risks, multivariate survival data), and causal inference. As the principal investigator, she has received fundings from PCORI and NSF to support her methodological work. Her collaborative research interests include basic, clinical, translational, and population sciences studies in areas including cancer, chronic and infectious diseases.

vii

1 Pragmatic Randomized Trials

1.1 Introduction Clinical trials have been broadly classifed as pragmatic or explanatory trials. Schwartz and Lellouch [1] probably frst used the terms “explanatory” and “pragmatic” to differentiate trials. The term “explanatory” was used for trials that test whether an intervention works under optimal situations that control heterogeneity as much as possible to isolate the intervention effect. The term “pragmatic” was used for trials that evaluate whether an intervention works in real-life routine practice conditions, which makes results widely applicable and generalizable. Explanatory trials are valuable for understanding questions of effcacy in an ideal patient population. Exploratory clinical trials, especially randomized clinical trials (RCTs), are designed as experiments with the ability to determine cause–effect relationships. They enroll highly selected subjects who may not represent the target population. Exploratory trials follow a strict treatment protocol to evaluate whether an intervention is effcacious and safe [2]. Findings from explanatory trials do not appear to be directly generalizable to a real-world clinical setting with real-world patients. The most distinctive features of pragmatic clinical trials (PCTs) are that they compare clinically relevant alternative interventions and include a broadly diverse population of study participants with few eligibility criteria. These features lead to increased feasibility and broader generalizability to the target population. Typically, the fndings from pragmatic trials have high generalizability. In contrast, explanatory trials are the most likely experiments where any differences between intervention groups resulted from scientifc testing under rigorously controlled conditions. Pragmatic and explanatory trials are not distinct concepts since trials may incorporate differing degrees of pragmatic and explanatory components. A trial can be designed to have some aspects that are more pragmatic than exploratory and vice versa. For example, a trial may have an explanatory element with strict eligibility criteria. Still, it may have a pragmatic aspect of a trial with minimal to no monitoring of practitioner adherence to the study protocol and no formal follow-up visits. Overall, there is a continuum DOI: 10.1201/9781003126010-1

1

2

Design and Analysis of Pragmatic Trials

between pragmatic and explanatory trials, with many trials having both pragmatic and explanatory elements included [3]. Tools have been developed to help researchers design pragmatic trials [3–6]. Gartlehner et al. [5] proposed a simple instrument to distinguish pragmatic from explanatory trials. They used a binary system (yes/no) to evaluate if the trial was pragmatic or exploratory using the seven study design criteria, even though they acknowledged that explanatory and pragmatic components exist in a continuum. Thorpe et al. [6] introduced the PragmaticExplanatory Continuum Indicator Summary (PRECIS) tool, which enabled investigators to design trials with the explanatory-pragmatic continuum in ten domains. They identifed some weaknesses, including unclear face validity and inter-rater reliability, lack of a scoring system, redundancy in some PRECIS domains, and the need for more guidance on using the tool. They developed a validated tool, Pragmatic-Explanatory Continuum Indicator Summary version 2 (PRECIS-2), by addressing the weaknesses while keeping the strengths of the original PRECIS tool [3]. PRECIS-2 is a nine-spoked wheel, with each spoke representing a domain that denotes a critical element of a trial design. Nine domains of PRECIS-2 are eligibility criteria, recruitment, setting, organization, fexibility in delivery, fexibility in adherence, follow-up, primary outcome, and primary analysis. Each element of the nine domains is scored on a fve-point Likert scale (from 1 denoting very explanatory “ideal conditions” to 5 indicating very pragmatic “usual care conditions”) to facilitate domain discussion and consensus. For interactive tools on the PRECIS-2, see https://www.precis-2.org/. We mark each spoke to represent the location on the explanatory (hub) to pragmatic (rim) continuum, connect the dots on a blank wheel graph in PRECIS-2, and link each of them to their immediate neighbor, as in Figure 1.1. Figure 1.1 shows the scores of each item in the PRECIS-2 wheel for a clinical trial to be developed. Predominantly explanatory trials produce spoke and wheel diagrams close to the hub, whereas pragmatic trials produce diagrams close to the rim. A pragmatic trial is “designed to test the intervention in the full spectrum of everyday clinical settings to maximize applicability and generalizability” [7]. An important feature of healthcare PCTs is leveraging the available data, such as the electronic health record (EHR) or insurance claims data, to cut costs and personnel effort associated with traditional exploratory trials. EHR is an appealing tool to identify, recruit, and enroll trial participants, perform randomization, follow participants over time, and collect patientrelevant outcome data because the data does not require participants’ contact or adjudication. Assessing study outcomes via linkage with EHRs decreases the burden on the participants and the research team [8]. EHRs have been used to evaluate the effectiveness of various interventions in some pragmatic trials [9–12]. In addition, pragmatic trials nested within EHRs can help overcome the limitations of explanatory trials, including high cost, time burden, and limited generalizability [13, 14].

Pragmatic Randomized Trials

3

FIGURE 1.1 The PRagmatic-Explanatory Continuum Indicator Summary 2 (PRECIS–2) wheel. Adapted from BMJ2015;350:h2147.

1.1.1 Statistical Issues in Pragmatic Randomization Trials An ideally conducted randomized clinical trial (RCT) ensures that the observed difference in outcomes between control and intervention groups is only attributed to the absence or presence of the intervention. Missing data may seriously compromise the credibility of causal inferences from PCTs. Missing data ubiquitously occur in PCTs that use electronic health records (EHRs) or insurance claims data and may compromise the causal inference if missing data is inappropriately handled. Missing data can happen due to various reasons, such as participants’ missing scheduled visits or dropout from the study. PCTs may have a high proportion of missing values since some providers may drop out of the trial or fail to collect specifc data from all individuals at follow-up visits. The probability that individuals are lost to follow-up or miss their follow-up appointment may depend on both provider- and individual-level characteristics. Therefore, missing data must be appropriately handled using valid statistical approaches in the design and analysis of PCTs to draw valid inferences [15]. An apparent reason to properly deal with missing outcomes is to retain the statistical power of the

4

Design and Analysis of Pragmatic Trials

original randomized clinical trial design. Statistical power is the probability of detecting a meaningful effect, given that such an effect exists. Missing data mechanisms are commonly summarized into three categories [16]. Data are considered missing completely at random (MCAR) if missingness is independent of the observed outcomes and covariates. MCAR is a strong assumption and is not likely in most clinical trials. MCAR indicates whether missing data is entirely unrelated to either observed or unobserved data. A more sensible assumption is missing at random (MAR), where missingness does not depend on unobserved data after conditioning on the observed data. A likelihood-based analysis using only the observed data can be valid under MAR if the analysis adjusts for those variables associated with the probability of not being observed. Data are termed missing not at random (MNAR) if missingness is dependent on unobserved data values even after conditioning on fully observed data. Complete case analysis may be valid if the probability of being a complete case is independent of the outcome given the covariates in the analysis model. However, the complete case analysis will lead to larger estimates of standard errors due to the reduced sample size. Complete case analysis may be unacceptably ineffcient if more than a small proportion of individuals are omitted. Additional approaches include single imputation, multiple imputation, and model-based methods. Single imputation approaches replace each missing value with a single value based on the sample mean, a regression-based prediction, the last observation carried forward (LOCF), or other strategies. In single imputation, the imputed values will be assumed to be the real values, as if no data were missing. Although single imputation approaches can give unbiased estimates, standard errors are generally too small because they do not refect uncertainty about the values of the missing observations. Rubin [17] proposed a procedure called multiple imputation to address this problem. Multiple imputation allows for the uncertainty about the missing data by generating different plausible imputed data sets, analyzing each flled-in data set separately, and combining the analyses into a single summary analysis with simple averaging formulas. In cluster randomized trials (CRTs), multiple imputation approaches need to incorporate the withincluster correlation in the multi-level design to ensure valid inferences since ignoring clustering in the multiple imputation can lead to an increase in type I errors [18]. If clusters are included as a fxed effect in the imputation model, imputations lead to excessive between-cluster variability, especially if cluster sizes are small and intracluster correlation is small [19]. In traditional RCTs, potential participants are identifed, contacted, and enrolled. They opt-in to traditional RCTs by giving informed consent. Potential participants decide whether or not to participate in trials after having received and understood all the research-related information. Barriers to enrollment are often high in traditional RCTs since the trial may require extra clinic visits, procedures, and interviews. As a result, consent

Pragmatic Randomized Trials

5

rates are often low, and only a small portion of the eligible population eventually enrolls due to high barriers in traditional RCTs. Pragmatic clinical trials aim to increase the proportion of the target population who eventually enroll in the trial. Pragmatic trials increase generalizability by conducting randomized controlled trials under routine clinical practice conditions instead of specialized research settings [1]. A robust approach in pragmatic trials is an opt-out consent process, in which the default is participation rather than non-participation. In an opt-out process, the staff informs potential study participants at their clinic visit. In an opt-out model, trial participants are included automatically but can opt out completely or with restrictions by allowing selected data to be used. The opt-out consent process is generally expected to result in a more diverse sample of trial enrollees, including many who would not enroll under the traditional opt-in consent process. Unconsented randomized trials must be carefully restricted to contexts where risks are very low and overseen by experienced institutional review boards (IRBs). Pragmatic trials frequently incorporate an opt-out model for the consent process, while traditional RCTs use opt-in consent. Another feature of healthcare pragmatic clinical trials is a cluster randomized trial (CRT). CRTs are particularly useful for evaluating the effectiveness of interventions at the level of health systems, practices, or hospitals [20]. Many challenges and sources of bias in PCTs are related to CRTs [21, 22]. In CRTs, treatment allocation often cannot be blinded to participants, study staff, or providers. In addition, there may be differences in the types of participants enrolled between treatment groups because the assignment of participants to treatment groups generally cannot be concealed. In pragmatic trials, CRTs are often the most feasible study design to avoid contamination in real-world settings. One example of contamination is that if a physician simultaneously provides care to patients assigned to the intervention and control groups, leakage of intervention information might occur between two groups, resulting in an observed intervention effect that is diluted and biased toward the null. A CRT is more complex to design, requires more study subjects, and needs more complex analysis than an individualized randomized trial (IRT). A well-recognized statistical issue in CRTs is that responses tend to be more similar within clusters than across clusters. This within-cluster similarity is quantifed by the intracluster correlation coeffcient (ICC). There has been an extensive investigation into design methods for CRTs to account for ICC properly. Furthermore, the size of the clusters generally varies among clusters. These features present statistical challenges in designing pragmatic cluster randomized trials [23–25]. Due to practical and fnancial constraints, CRTs typically randomize a limited number of clusters. When there are a small number of clusters with diverse settings and populations, simple randomization can result in study groups that differ substantially on critical clinical variables or potential confounders. Unbalanced intervention groups constitute a potentially severe

6

Design and Analysis of Pragmatic Trials

methodological problem for CRTs because an imbalance in crucial baseline characteristics between intervention groups diminishes the case for causal inference, which is the fundamental strength of the randomized clinical trial [26]. Therefore, investigators planning a CRT should consider balancing baseline characteristics. PCTs often require more participants to achieve adequate statistical power to detect clinically meaningful differences. Assessment of repeated outcomes on the same individual may be necessary to examine the effect of an intervention over time. In cluster randomized trials, the need for greater statistical effciency may encourage the adoption of multiple-period CRTs such as parallel arm longitudinal, crossover, and stepped wedge CRTs that measure outcomes repeatedly in the same cluster (and the same individuals) over time. Standard statistical methods for the analysis of IRTs may not be appropriate. Statistical analysis must incorporate the design features of pragmatic CRTs, such as clustering, repeated measurements, and correlations over time. The correlation structure in CRTs is much more complicated than that in IRTs. For example, matched-pair CRTs and multiple-period CRTs must incorporate complex correlation structures. Matched-pair CRTs need to include the following correlations for the design of trials [27, 28]: the within-cluster correlation ( r1 ), the within-matched individual (study unit) correlation ( r2 ), and (3) the within-matched cluster correlation ( r3 ). If matching occurs at the cluster level, the correlation structure would depend only on r1 and r3 with r2 = r3 . Multiple-period CRTs such as longitudinal, crossover, and stepped wedge CRTs need to incorporate within-subject or longitudinal correlation, the conventional intracluster correlation, and crosssubject cross-period correlation [29–34]. The repeated measurements may burden patients regarding quality of life, satisfaction, and costs. Moreover, the proportion of missing data is usually higher due to repeated measures over a longer duration, which cannot be ignored in the design of trials.

1.2 Cluster Randomized Designs A cluster randomized trial (CRT) randomly assigns groups of individuals to each intervention group rather than individuals. In an individually randomized trial (IRT), individuals are randomized to each intervention group, and the intervention is applied at the individual level. A cluster randomized trial has been frequently used in health services research, primary care, and behavioral sciences [23, 35–40]. In designing a CRT, one of the frst decisions to be made is the choice of clusters. For example, in a cancer screening CRT [41], one needs to determine if one will use a physician or a clinic as a cluster. If a physician shares a nurse,

7

Pragmatic Randomized Trials

one may not use a physician as a cluster due to possible contamination. In a CRT, naturally occurring groups or clusters of individuals are the experimental units randomized to each intervention group. The clusters of individuals in CRTs have included families, churches, schools, work sites, clinics, healthcare centers, and communities. CRTs help evaluate interventions when interventions are delivered at the cluster level or when there is a risk that IRTs will contaminate interventions between intervention groups. Cluster randomization is often advocated to minimize treatment contamination between intervention and control participants. For example, in a smoking cessation trial, people in the control group might learn about experimental smoking cessation and adopt it themselves. Contamination of control participants leads to a reduction in the point estimate of an intervention’s effectiveness, which may lead to the rejection of an effective intervention as ineffective because the observed effect size was neither statistically nor clinically signifcant. In addition to minimizing potential “contamination," CRTs can also have practical advantages over IRTs because of ethical considerations, lower implementation costs, ease of obtaining the cooperation of investigators, enhancement of subject compliance, or administrative convenience. The design and conduct of a CRT require special considerations. The lack of independence among individuals in the same cluster, i.e., withincluster correlation, creates unique methodological challenges in design and analysis. That is, observations on subjects from the same cluster are likely more correlated than those from different clusters. The correlation within clusters, which is quantifed by the intracluster correlation coeffcient (ICC), r , may result in substantially reduced statistical effciency relative to trials that randomize the same number of individuals. In CRTs, the total variance of an observation can be divided into two components, s 2 = s B2 + s W2 , where s W2 is the within-cluster variance and s B2 is the between-cluster variance. That is, s W2 is the variance of observations within a cluster and s B2 is the variance of cluster-level means. The degree of similarity or magnitude of clustering is typically measured by ICC ( r ). ICC is defned as the proportion of the total variance in the outcome that is attributable to variance between clusters [23, 35]. The intracluster correlation is

r=

s B2 s B2 = s 2 s B2 + s W2

.

The reduction in effciency is a function of the variance infation due to clustering, also known as the design effect or variance infation factor (VIF), given by VIF = 1 + (m - 1)r , where m denotes the cluster size when cluster sizes are equal across clusters. Inferences are frequently done at the individual level, while randomization is at the cluster level in CRTs. Thus, the unit of randomization may be different from the unit of analysis. In this case, the lack of independence among

8

Design and Analysis of Pragmatic Trials

individuals in the same cluster presents unique methodological challenges that affect both the design and analysis of CRTs. Suppose one ignores the clustering effect and analyzes clustered data using standard statistical methods developed to analyze independent observations. In that case, one may underestimate the true p-value and infate the type I error rate of such tests. Therefore, clustered data should be analyzed using statistical methods that account for the dependence of observations on subjects within the same cluster. If one fails to incorporate the study design’s clustered nature during the study’s planning stage, one will obtain a smaller sample size estimate and statistical power than planned. The analysis of data collected from a CRT must also account for clustering. Analyses at the cluster level (i.e., using summary measures for each cluster) are generally not statistically effcient. Individual-level analysis can account for clustering using generalized estimating equations (GEE) or generalized linear mixed models (GLMM). These modeling techniques also allow for adjusting both cluster-level and individual-level covariates. CRTs are less effcient than IRTs since outcomes of subjects in the same cluster tend to be positively correlated. In addition, it is not always feasible to increase the sample size, even though effciency can be improved by increasing the sample size. Thus, alternative strategies have been proposed, such as reducing residual variance by matching, stratifcation, balancing, including covariates, taking repeated measurements, and using a crossover design. Turner et al. [42, 43] reviewed the design and analysis of methodological developments in group-randomized trials. Textbooks are available for the design and analysis of CRTs [23, 35–40]. 1.2.1 Completely Randomized Cluster Trial Design The completely randomized cluster trial design is analogous to the completely randomized individual trial design, except that clusters are randomized instead of individuals. The completely randomized cluster trial design is the most widely understood CRT design. Clusters are randomly assigned to intervention and control groups using simple randomization. Both the sample size calculation and the subsequent statistical analysis must consider the ICC induced by cluster randomization since the responses of observations within a cluster are usually correlated. Donner et al. [44] provided a sample size estimate for the number of subjects in a completely randomized cluster trial design, which can be obtained by multiplying a design effect (DE) by the required number of subjects calculated under an IRT. Under equal cluster sizes, the design effect can be computed as DE = 1 + (m - 1)r , where r is the ICC and m is the cluster size. Let nind be the number of subjects required for an IRT. Then, the number of subjects needed for a CRT is n = nind × {1 + (m - 1)r } . The design effect, 1 + (m - 1)r , is usually greater than one since r is generally positive. The design effect is also called the variance infation factor (VIF). Let m be the average cluster size, and x represent the

Pragmatic Randomized Trials

9

coeffcient of variation of the cluster sizes. For the unequal cluster sizes, the DE can be approximated by DE = 1 + [(x 2 + 1)m - 1]r , which gives a slight overestimate of the design effect [45]. It was recommended that trials account for variable cluster size in their sample size calculations if x is greater than or equal to 0.23 [45]. The effect of adjustment for variable cluster size on sample size is negligible when x is less than 0.23. For unequal cluster sizes, VIFs were provided for continuous outcomes [24], binary outcomes [25], and count outcomes [46], respectively. Variation in cluster sizes affects the sample size estimate or the power of the study. Guittet et al. [47] and Ahn et al. [48] investigated the impact of an imbalance in cluster size on the power of trials through simulations for continuous and binary outcomes, respectively. The primary analysis approaches are cluster-level analyses and individuallevel analyses such as mixed-effects models and generalized estimating equations (GEEs) [49]. Cluster-level analysis is a two-stage approach with estimating a summary measure of the outcome for each cluster frst and then analyzing the summaries using standard methods for independent data. Variance-weighted cluster-level analysis has been proposed to improve effciency when clusters are not of the same size. Mixed-effects models and GEEs have been frequently used for the analysis of individual-level data. However, these methods can lead to infated type I error rates when the number of clusters is small. For mixed-effect models, small-sample corrections to circumvent the infated type I error rate involve degrees-offreedom corrections [50–52]. For GEEs, the corrections typically involve corrections to standard errors [53, 54]. Example: Colorectal cancer screening is effective but underused. Guidelines for recommended tests and screening intervals depend on specifc risks. To facilitate risk-tailored testing in the primary care context, Skinner et al. [41] developed a tablet-based Cancer Risk Intake System (CRIS). CRIS asks questions about risk before appointments and generates tailored printouts for patients and physicians, summarizing and matching risk factors with guideline-based recommendations. In a pragmatic CRIS trial, physicians were randomly allocated to either a comparison or risk-based CRIS group. In this CRIS trial, a physician corresponds to a cluster. Based on the assignment of his or her physician, each patient was assigned to the CRIS intervention or a comparison group. A CRT was selected over an IRT to avoid contamination and for logistical reasons such as ease of administration.

The investigators plan to conduct a CRIS trial in Hispanic men and women of 40–65 years old. The primary outcome is participation in risk-appropriate colorectal cancer testing (yes/no, 1 = participation in any risk-appropriate testing, 0 = non participation). We anticipate about 20% of the comparison group will participate in appropriate testing based on pilot and published data [55]. We will assume that the CRIS group will yield at least a 32% participation

10

Design and Analysis of Pragmatic Trials

rate. In an IRT ( r = 0 ), we need a sample size of 206 subjects per intervention group using a 5% signifcance level and 80% power. Suppose that each physician recruits 23 patients (m = 23). With r = 0.02 , the design effect, also called the variance infation factor, is DE = 1 + (23 - 1) × r = 1.44, meaning that the CRT requires 1.44 times as many patients as required for an IRT, all else being equal. CRT needs to recruit 297 (= 206 × DE ) patients per intervention group. Since each physician recruits 23 patients per intervention group, the number of physicians (n) to be recruited is 13. Cluster sizes are expected to be unequal because a different number of patients will be recruited among physicians. The required number of physicians can be estimated using the design effect computed under variable cluster sizes for binary outcomes [25]. In some cluster randomization trials such as the GOODNEWS (Genes, Nutrition, Exercise, Wellness, and Spiritual Growth) [56], the number of clusters cannot exceed a specifed maximum value due to practical considerations such as the fxed expense of maintaining staff to collect data and conduct an intervention in each cluster. In those trials, the sample size requirement should be provided for the number of subjects per cluster instead of the number of clusters per group. Donner and Klar [23] and Taljaard et al. [57] provided the sample size formula for the number of subjects required per cluster, assuming the number of subjects in each cluster is fxed in advance. 1.2.2 Restricted Randomized Designs: Strategies to Improve Efficiency for CRTs Cluster randomization trials with relatively few clusters have been widely used to evaluate healthcare strategies. For CRTs with a limited number of available clusters, a practical limitation of the simple completely randomized parallel design is an imbalance in baseline characteristics (covariates) across intervention groups, which would threaten the internal validity of the trials. Randomization does not ensure that the intervention groups will be balanced in crucial baseline characteristics when a limited number of clusters are used in CRTs. Unbalanced intervention groups constitute a potentially severe methodological problem for CRTs because observed differences in outcomes between intervention groups may not be due to the intervention but due to the differences in the underlying study populations [26]. That is, an imbalance in crucial baseline characteristics between intervention groups diminishes the case for causal inference, which is the fundamental strength of the randomized trial. Therefore, investigators planning a cluster randomized trial should consider balancing baseline characteristics. In addition, an imbalance can lead to differences between crude and adjusted estimates of intervention effects, which hinders the interpretability and face validity of the fndings [23, 26]. Various restricted randomization methods have been proposed to minimize imbalance in CRTs. Ivers et al. [26] provided the advantages and limitations of different allocation techniques, including matched-pair,

Pragmatic Randomized Trials

11

stratifcation, minimization, and covariate-constrained randomization, and guided in choosing the best allocation technique for CRTs. 1.2.2.1 Matched-Pair Cluster Randomized Design Matched-pair designs have been frequently used in community intervention trials. For example, Kroeger et al. [58] conducted a matched-pair cluster randomized trial of interventions against Aedes mosquitoes. The primary outcome was the number of containers per house that were positive for immature stages of the mosquito. Pairs of clusters of households were randomized to a package of window curtains and water container covers treated with insecticide, or to control (no intervention). A matched-pair cluster randomized design seems attractive since matching before random assignment of clusters to interventions can signifcantly improve the effciency of causal effect estimation [59]. Matched-pair CRTs are widely conducted by pairing the intervention and control clusters based on selected common features related to the primary outcome to mitigate baseline imbalance and improve statistical power. In a matched-pair cluster randomized design, clusters are formed into pairs based on similar baseline characteristics such as cluster size, geographic location, or other features potentially related to the primary outcome measure. One cluster within each pair is randomly allocated to an intervention and the other to a control. For a matched-pair cluster randomized trial, the matching would be both at an individual level and at a cluster level. Matching reduces between-cluster variability in estimating the intervention effect, reducing the standard error and increasing power and precision. Several authors examined the effect of matching in randomized community intervention trials using t-tests [60, 61]. Suppose that there are n clusters in each intervention group, and the continuous outcome is analyzed by paired and unpaired t-tests for matched and unmatched data, respectively. For unmatched analysis with continuous outcomes, the degrees of freedom will be 2(n - 1) for a t-test statistic that compares the outcome between the two intervention groups. For a matched analysis, with n-matched cluster pairs, the degree of freedom will be n - 1 for a t-test statistic. That is, the matched analysis has half the degrees of freedom of the unmatched analysis. A matched-pair cluster randomized trial has less power than an unmatched cluster randomized trial since a less degree of freedom for a t-distribution corresponds with a larger critical value [23, 37]. Therefore, the gain in correlation within a pair must be larger than the homogeneity within a pair for matching to be worthwhile in a matchedpair CRT [60]. Let rM be the matching correlation, defned as the standard Pearson correlation for the outcome computed over the paired clusters. Matching will yield an effciency gain only if the matching correlation, rM , is suffciently high to offset the reduction in degrees of freedom. For this reason, it is worthwhile to match only cluster-level factors known to be strongly associated with the outcome [23]. Breakeven values of rM are

12

Design and Analysis of Pragmatic Trials

the smallest possible correlations that matching must induce for matchedpair cluster randomization to detect a smaller effect size than the unmatched cluster randomization [37, 59]. The approximate reduction in the design effect for a matched-pair design compared with a completely randomized design is 1 - rM [23, 62]. The design effect will be reduced by 30% if the matching correlation ( rM ) is 0.3. Therefore, an effciency gain will be obtained through matching only if rM is suffciently large to offset the reduction in degrees of freedom. When the number of clusters is large, ineffective matching is not a severe problem. However, matching or not is an important design issue in CRTs when the number of clusters to be randomized is relatively small. Martin et al. [60] compared the statistical power between the matched-pair cluster randomized design and a completely randomized parallel design when the number of matched-pair clusters is small to moderate. They showed that when the number of available pairs is ten or less, then matching should be used only if rM is at least 0.2. For the analysis of continuous outcomes, cluster-level analysis can be done using statistical methods such as paired t-test, Wilcoxon signed-rank test, and permutation test. Let di be the mean difference in the continuous outcome between clusters in the ith matched-pair. A simple approach to test an intervention is applying the standard paired t-test to the mean differences di , i = 1, 2,..., n , where n is the number of matched cluster pairs. The paired t-test assumes that the di is normally distributed with a mean ( D ) and an equal variance across the matched-pairs. Since cluster sizes generally vary across matched-pairs, the assumption of equal variances is violated. When clusters are not the same size, variance-weighted cluster-level analysis has been proposed to improve effciency. When the number of matched-pair is large, violation of the normality assumption is not serious since di will have an approximately normal distribution with the central limit theorem. The Wilcoxon signed-rank test or Fisher’s permutation test can be used as an alternative to the paired t-test in the presence of obvious non-normality or heterogeneity in the variances of the di [23]. These approaches take advantage of avoiding normality assumptions. Wilcoxon signed-rank test is applied to di based on the ranks of the di instead of the actual values. Fisher’s one-sample permutation test uses the magnitude and the ranks of the di . A limitation of the exact permutation test is that we need a minimum of six matched-pairs of clusters to obtain a two-sided p-value of less than 0.05. For an individual-level analysis, generalized estimating equation (GEE) and mixed-effects models can be used to analyze the continuous outcomes from matched-pair CRTs. For the analysis of binary outcomes, Donner and Klar [23] provided statistical methods such as the standard Mantel–Haenszel one degree of freedom chi-square test, Liang’s chi-square test for analyzing 2 × 2 tables [63, 64], paired t-test, Wilcoxon signed-rank test, and one-sample permutation test. The data from a matched-pair CRT with a binary outcome variable may be summarized

Pragmatic Randomized Trials

13

in a series of n 2 × 2 tables, where n is the number of matched cluster pairs with the row for responses (yes/no) and the column for intervention groups (intervention/control). The standard Mantel–Haenszel one degree of freedom chi-square test for analyzing matched-pair clustered data can lead to spurious statistical signifcance since the test ignores the clustering effect. It is recommended not to use a standard Mantel–Haenszel chi-square test to analyze binary outcomes in a matched-pair CRT. Liang [63] developed a chi-square test for the analysis of 2 × 2 tables by relaxing the binomial assumptions. Liang’s chi-square test may apply to CRTs when there are at least 25 pairs. The paired t-test can be applied to the difference di for the matched-pair CRTs, where di is the difference in the proportions of response between intervention and control groups in the ith pair. Individual-level analysis can be conducted using a generalized estimating equation (GEE) or generalized linear mixed model (GLMM). Shipley et al. [65] provided a sample size formula and power of the test that compares the incidence rates between two groups in matched-pair cluster randomized trials. They investigated the factors affecting the power of matched-pair CRTs and provided an example where the power of a CRT is compared to that of an IRT. Donner and Klar [23] provided the sample size formulas that compare the means, proportions, and incidence rates between two groups in matched-pair cluster randomization designs. Rutterford et al. [66] provided the variances for binary and continuous outcomes for matchedpair cluster randomization trials, which can be used for sample size estimation. Missing data is common in matched-pair cluster experimental designs. Investigators often encounter incomplete observations of paired outcomes. For pairs of observations, some subjects contribute complete pairs of outcomes from both intervention groups. In contrast, some subjects contribute only one outcome from either an intervention or control group. In sample size calculation, one common approach to handle such missing data is to use a simple adjustment that divides the sample size estimate under no missing data by the expected follow-up rate. Xu et al. [27, 28] used the GEE method to derive a closed-form sample size formula for matchedpair cluster randomization design with continuous and binary outcomes by treating incomplete observations as missing data in a marginal linear or logistic regression model. The sample size formula was developed using a general correlation structure for the outcome data, including three intraclass correlations: (1) the within-cluster correlation ( r1 ), which describes the similarity between outcomes from different study units within the same cluster under the same treatment; (2) the within-matched individual (study unit) correlation ( r2 ), which describes the similarity between outcomes from two matched individuals under different treatments; (3) the withinmatched cluster correlation ( r3 ), which describes the similarity between outcomes from different study units within two matched clusters but under different treatments. The within-matched individual correlation is required if matching is done at the individual level. However, if matching occurs only

14

Design and Analysis of Pragmatic Trials

at the cluster level, the correlation structure would depend only on r1 and r3 with r2 = r3 . Example: COMMIT (Community Intervention Trial for Smoking Cessation) is a cluster randomized trial employing a matched-pair design [67]. The community was chosen as the unit of randomization in the COMMIT since the interventions were conducted on communities rather than individuals. The investigators selected 22 communities and matched these communities based on several factors expected to be related to the smoking quit rates, such as community size, geographical proximity, age and sex composition, degree of urbanization, and socioeconomic factors. The COMMIT design included 11 matched-pair communities. Each member of the 11 community pairs was randomly assigned to receive active intervention and the other to receive control surveillance. This trial aims to assess the effect of a community program encouraging people to stop smoking. The rates of quitting smoking in the intervention and control communities were compared using a matched-pair analysis, based on the permutation distribution. Gail et al. [67] provided a detailed discussion of permutation tests for the design and analysis of the COMMIT.

1.2.2.2 Stratifed Cluster Randomized Design Matched-pair cluster randomized design can help minimize the differences between intervention groups for prognostic factors potentially related to an outcome and can improve the power and precision of the study. However, there is a possibility that substantial imbalances in other essential factors arise between intervention groups. If there are a small number of categorical covariates, a simple way to handle imbalances is to use a stratifed cluster randomized design. Stratifcation involves grouping clusters into two or more strata that are expected to be similar to the outcome of interest. The clusters within each stratum are then randomly allocated between the intervention groups. In stratifed CRTs, stratifcation factors often include cluster size, clusterlevel socioeconomic status, geographic location, and the healthcare system. Cluster size is commonly used as a stratifying factor because cluster size may be associated with critical cluster-level factors and has been recognized as a surrogate for within-cluster dynamics that predicts outcomes in cluster randomization trials [23]. The conventional randomization procedure might fail to achieve balance in cluster size between intervention groups, leading to biased estimation of an intervention effect. One possible solution is to adopt the stratifed cluster randomization design, where clusters are frst grouped into strata based on size (e.g., small, medium, and large) and then randomized to intervention groups within each stratum. Stratifed CRTs are popular because they increase the chance of the intervention groups being well balanced regarding identifed prognostic factors at baseline and may increase

Pragmatic Randomized Trials

15

statistical power. The stratifed design may be viewed as replicating a completely randomized parallel design in each stratum. The continuous outcomes in a stratifed CRT can be analyzed using a t-test based on the weighted mean difference [23]. The stratum-specifc effects are measured by the difference of the means ( di ) in the ith stratum. The variance of di is computed incorporating an intracluster correlation among subjects within a cluster. The weighted statistic is set up based on the weighted mean difference with weights chosen to refect the relative information in the di [68]. A standard Mantel–Haenszel statistic was extended to analyze binary outcomes, incorporating intracluster correlation ( r ) among observations within a cluster [62]. If r is equal to zero, the generalized Mantel–Haenszel statistic is identical to a standard Mantel–Haenszel statistic. GEE and GLMM can be used to analyze the binary and continuous outcome data by adjusting individual-level as well as cluster-level covariates from a stratifed cluster randomized design. Donner [69] extended the sample size formula for the Cochran–Mantel–Haenszel statistic with binary outcomes [70]. Donner’s sample size formula assumes the cluster size to be equal within strata but different across strata. With varying cluster sizes, commonly used ad hoc approaches have assumed the cluster sizes to be constant within each stratum, which may underestimate the required number of clusters for each group per stratum and lead to underpowered clinical trials. Xu et al. [71] extended Donner’s sample size formula [69], incorporating the variability in cluster size with the coeffcient of variation ( x ) of cluster size. Wang et al. [72] also accommodated the variability of cluster size and proposed the closedform sample size formula for cluster size-stratifed CRTs with continuous outcomes. The sample size formulas of Xu et al. [71] and Wang et al. [72] are fexible to accommodate an arbitrary randomization ratio and varying numbers of clusters across strata. The advantages of stratifed CRTs include increased precision (and power) and balanced treatment groups. The number of strata increases rapidly as the number of stratifying factors increases. Suppose that there are two categories in each stratifcation factor. The number of strata is 2k when the number of stratifcation factors is k. Relatively few stratifcation factors can be included in stratifed cluster randomized designs since the number of strata increases exponentially as the number of stratifcation factors increases. Example: In a pragmatic stratifed cluster randomization trial called “Improving chronic disease management with PIECES (NCT02587936)”, primary care clinics were randomly allocated to either an intervention or standard medical care group using a randomized permutation block within each healthcare system. The stratum is each of the four large healthcare systems participating in the study. The unit of randomization is a primary care clinic. The cluster randomization design with clinics receiving collaborative primary care–subspecialty care vs standard care limits the risk of cross-contamination between intervention and control

16

Design and Analysis of Pragmatic Trials

groups. Based on the assignment of the clinic where a patient goes, each patient will be assigned to either the intervention or standard medical care group. Evaluation will be performed to determine treatment effects on disease-specifc hospitalizations, ER visits, cardiovascular events, and deaths.

The sample size was estimated based on comparing one-year unplanned allcause hospitalization rates between the control and intervention groups. The sample size formula developed by Donner [69] for stratifed randomized trials was employed. Preliminary data showed that the rate of unplanned allcause hospitalization during the one-year follow-up period was 14% across all four large healthcare systems in the standard medical care group. The hospitalization rate in the intervention group is expected to be 3% lower than that in the standard medical care group. A generalized linear mixed model will be used to investigate if there is a signifcant difference in the rate of unplanned all-cause hospitalization during the one-year follow-up period between the intervention group and the standard medical care group. Electronic health records show that the number of patients with coexistent CKD, hypertension, and diabetes are 4,419, 4,738, 4,175, and 1,093 in healthcare systems 1, 2, 3, and 4, respectively. The numbers of clinics are 25, 40, 50, and 9 in healthcare systems 1, 2, 3, and 4, respectively. A preliminary dataset obtained an intracluster correlation coeffcient (ICC) of 0.015. If we assumed ICC = 0.015 to detect a 3% difference in the rate of unplanned all-cause hospitalization, a total of 10,991 patients are needed to achieve 80% power at a two-sided 5% signifcance level. We would need to enroll 3,367, 3,610, 3,181, and 833 patients from healthcare systems 1, 2, 3, and 4, respectively. Here, 10,991 patients are 76.2% of 14,425 patients from the four healthcare systems participating in the study. Waiver of consent has been obtained from all institutional review boards (IRBs) overseeing the research, and no informed consent will be obtained from patients. An opt-out option is available for patients in the intervention and control groups. A small number of opt-out requests are expected based on the nature of the interventions. 1.2.2.3 Covariate-Constrained Randomized Design Unbalanced intervention groups lead to a potentially severe methodological problem for CRTs. Simple or matched-pair or stratifed randomization often cannot adequately balance multiple cluster-level covariates between intervention groups when the number of clusters is small. Therefore, covariate-constrained randomization was proposed as an allocation technique to achieve balance when clusters can be recruited before randomization. Once cluster-level baseline data are obtained, the randomization process can begin. Moulton [73] described the covariate-constrained randomization (CCR) procedure that enumerates all the possible allocations of participating clusters when clusters and their covariates are known in advance. A

Pragmatic Randomized Trials

17

æwö randomization scheme is randomly chosen from all the possible ç ÷ ranègø domization schemes, where w is the total number of clusters and g is the number of clusters assigned to one study arm. For example, if we design a CRT with ten clusters, fve of which are assigned to intervention and fve æ 10 ö to control, there are ç ÷ = 252 unique allocations of ten clusters evenly è5ø assigned to two arms. The randomization scheme is randomly selected from a subset of all possible randomization schemes using a score that assesses covariate imbalance. CCR calculates a score that assesses covariate imbalance from all the possible allocation schemes, limits the randomization space to prespecifed constraints, and then selects a scheme from the constrained list to implement an acceptable allocation. Example: The reminder–recall (R/R) immunization trial compared the population-based R/R approach with the practice-based R/R approach to increase up-to-date immunization rates in 19- to 35-month-old children in 16 counties in Colorado [74]. The population-based intervention used collaborations between primary care physicians and health department leaders to develop a centralized R/R notifcation using telephone or mail for all parents whose children were not up to date on immunizations. The practice-based group invited practices to attend a webbased training on R/R using the Colorado Immunization Information System (CIIS). Sixteen counties were randomized equally to each of the two R/R approaches. Within each stratum (rural, urban), they generated 70 possible combinations of eight counties into two study groups and standardized county-level variables by computing simple z scores for each measure. Let xi be the cluster-level variable with x as the mean of xi and s as the standard deviation across clusters. To allow each measure to contribute approximately equally to the balancing process, cluster-level variables were standardized using a simple z score, zi = (xi - x)/ s . For each possible randomization, clusters were allocated to a treatment group. Next, the absolute value of the difference between the allocated treatment groups on the original county-level raw variables was computed for each randomization. The balance criterion measures the overall imbalance. The best 10% randomizations for rural and urban counties were designated as the optimal set.

1.2.3 Multiple-Period Cluster Randomized Designs CRTs have lower effciency than IRTs due to correlation among subjects within the same cluster. Effciency can be improved by increasing the sample size. However, increasing the sample size is not always possible since the number of clusters and cluster sizes are often limited. One strategy to increase effciency is to implement a multiple-period design within a CRT. That is, multiple-period CRTs extend traditional CRTs over time, with each cluster contributing observations over prespecifed time points, possibly

18

Design and Analysis of Pragmatic Trials

switching between intervention groups. The most basic type of multipleperiod CRTs is a longitudinal cluster randomized trial (LCRT), where clusters of subjects are randomly assigned to either a control or an intervention group and measured repeatedly at prespecifed time points in the assigned group for the duration of the study [75, 76]. Crossover cluster randomized trials (CCRTs) and stepped-wedge cluster randomized trials (SW-CRTs) are two other types of multiple-period CRTs. CCRTs [77–80] aim to increase statistical power compared to LCRTs by alternating clusters between control and intervention groups. In CCRTs, clusters of subjects are randomly assigned to different sequences of treatments over the study period. Correlation structures are complicated in LCRTs and CCRTs due to the longitudinal and clustered nature of trials. Wang et al. [76] used the GEE method to derive a closed-form sample size formula for LCRTs using a general correlation structure for the outcome data, including (1) correlation among repeated measurements from the same subjects (within-subject or longitudinal correlation), (2) correlation between concurrent measurements from different subjects within the same cluster (the conventional intracluster correlation coeffcient, ICC), and (3) correlation between nonconcurrent measurements from different subjects within the same cluster (cross-subject cross-period correlation). Ignoring the above correlations leads to biased estimation and improper sample size estimation. Generalized linear mixed model (GLMM) and generalized estimating equation (GEE) approaches have been commonly used for sample size estimation and statistical analysis. SW-CRTs [81–85] randomize clusters of subjects to one of the intervention sequences. Each sequence requires that clusters start in the control condition and then switch to the intervention condition one at a time at prespecifed time points. Hemming et al. [86] provided a tutorial on sample size calculation for multiple-period cluster randomized trials using the Shiny CRT Calculator (https://clusterrcts .shinyapps.io/rshinyapp/). There are three major types of multiple-period CRTs [30, 86]. One is a closed-cohort design, where all subjects are recruited at the beginning of the trial, repeatedly assessed throughout the study; hence, longitudinal measurements are collected from the same subjects. New subjects cannot join the trial as it is ongoing. Another is a cross-sectional design, where subjects are selected for measurements at a specifc time point and will not be selected again at a later time point. That is, different subjects are assessed at each time point. The other is an open-cohort design, where subjects join and leave the clusters over time. That is, some subjects are repeatedly assessed throughout the study, while others can leave or join in the middle of the study. 1.2.3.1 Longitudinal Cluster Randomized Design In longitudinal cluster randomized trials (LCRTs), clusters of subjects, such as those formed by clinics and communities, are randomly assigned to

Pragmatic Randomized Trials

19

FIGURE 1.2 Diagrams of the LCRT and CCRT with two sequences. (Blank and shaded cells represent control and intervention, respectively.)

the intervention groups and are measured repeatedly at prespecifed time points throughout the study [87]. The upper panel in Figure 1.2 illustrates the intervention scheme of an LCRT design, which has been widely used in behavioral and health service research. Examples include longitudinal community intervention studies or family studies with repeated measures on each subject [88–90]. For instance, in community intervention trials, outcome measures such as blood pressure or quality of life is collected on each subject from each community over time. In clinical trials with repeated measurements, the responses from each subject are measured multiple times during the study period. Two approaches have been widely used to assess the intervention effect: one that compares the rates of change and the other that compares the time-averaged responses between the intervention groups. Sample size and power formulas were developed to compare the time-averaged responses and the rates of changes over time using GEE [75] and GLMM [91–93]. For example, Wang et al. [76] provided closed-form sample size and power formulas for comparing the time-averaged responses based on the GEE approach. These formulas are fexible to incorporate unbalanced randomization, different missing patterns, arbitrary correlation structures, and randomly varying cluster sizes, providing a practical yet robust sample size solution. Hooper et al. [94] showed how to calculate sample size for a longitudinal cluster randomized trial with a repeated cross-section or closed cohort design. Sample size formulas were presented for open cohorts and the impact of the degree of “openness” on sample size [95]. For proper statistical inferences, generalized linear mixed model (GLMM) and generalized estimating equation (GEE) approaches have been commonly used in analyzing data from LCRTs. Example: Bosworth et al. [89] conducted an LCRT on blood pressure (BP) control at a Veterans Affairs (VA) Medical Center. Physicians were randomly allocated to either the control group or the intervention group.

20

Design and Analysis of Pragmatic Trials

In this LCRT, a physician corresponds to a cluster. Based on the assignment of his or her physician, each patient was assigned to either the control or intervention group and followed over time. In the intervention group, primary care providers (PCPs) received computer-generated decision support designed to improve guideline-concordant medical therapy at each clinic visit. In the control group, PCPs received a standard reminder at each visit. BP values were obtained at each primary care visit according to VA standards. An LRCT was selected over a longitudinal individual randomized trial to avoid contamination and for logistical reasons such as ease of administration. The primary outcome was the proportion of patients who achieved a BP of less than 140/90 mm Hg (less than 130/85 for diabetic patients) over the 24-month intervention. A hierarchical mixed-effects logistic regression model was used to investigate if there were signifcant differences in BP control in the three intervention groups compared to the hypertension reminder control group. The model included patient-level random effects in the form of a patient-level random intercept and slope to account for the correlation between a patient’s repeated BP measurements. The model also included a provider group-level random effect to account for the correlation of patients within the same provider.

1.2.3.2 Crossover Cluster Randomized Design In crossover cluster randomized trials (CCRTs), clusters of subjects are randomly assigned to different sequences of interventions over the study period. Each cluster receives each intervention in a separate period. For example, the lower panel of Figure 1.2 illustrates a typical CCRT with two treatments. In a two-period, two-intervention CCRT design, the so-called AB/BA design, in which each cluster is randomly assigned to one of two groups of the trial. Subjects within clusters in the frst group receive intervention A in the frst period of the trial and intervention B in the second period, and subjects within clusters included in the second group receive intervention B in the frst period and intervention A in the second period. CCRTs are frequently employed in clinical research [77–80, 96–98]. With the crossover design, clusters are randomized to a sequence of intervention conditions rather than just one intervention condition. CCRTs with two periods were commonly applied. Many studies have also investigated and employed CCRTs with multiple periods [77, 99, 100]. CCRTs increase statistical effciency because each cluster receives both intervention conditions, which allows comparisons within clusters [100]. The comparisons within clusters help to eliminate chance imbalances between intervention and control conditions on cluster-level characteristics even in studies with a small number of clusters. CCRTs can have more power to detect a given treatment effect than the cluster-randomized non-crossover trials with the same number of subjects. Thus, CCRTs can be especially useful when the number of available clusters is limited. A crossover design may be cost-effective if enrolling a cluster costs much or if there are few available clusters. The

Pragmatic Randomized Trials

21

principal drawback of the CCRT is that the effects of one intervention may “carry over” and alter the response to subsequent interventions. CCRTs cannot be conducted if receiving the intervention in the frst period would affect observations in the second period. For example, a training/education intervention for smoking cessation could not be crossed over because there would not be a way to remove the effects of the intervention during the control period once a group received the training/education intervention. However, the crossover design could be implemented if the intervention does not have lingering effects. In a Nordic diet study, Danish municipal schools were randomized to intervention sequences assuming that the effect of the intervention in the frst period would not carry over into the second period [97, 98]. Sample size formulas have been proposed for the two-period, twointervention, cross-sectional CCRTs [101, 102, 78]. Sample size formulas for continuous and binary outcomes were developed for the cross-sectional, twoperiod, two-intervention CCRTs using GEE approaches, accounting for both the within-period and inter-period correlations [31]. Sample size formulas for cross-sectional CCRTs were extended for cohort CCRTs by extending the two-period nested exchangeable correlation structure to a two-period block exchangeable correlation structure. Using GEE approaches, sample size formulas were developed incorporating the missing data patterns and arbitrary correlation structures [76]. Example: The Head Positioning in Acute Stroke Trial (HeadPoST) is a two-period pragmatic CCRT conducted at 114 hospitals in nine countries [80]. The trial aimed to compare the effects of the lying-fat position with those of the sitting-up position, initiated soon after a stroke and maintained for 24 hours. Hospitals were randomly assigned to two sequences. The frst sequence received the lying-fat position and then crossed over to the sitting-up position. The second sequence received two positions in the reverse direction. The degree of disability in the modifed Rankin scale score was collected for each patient. The primary outcome was the degree of disability at 90 days, as assessed with the modifed Rankin score ranging from 0 to 6, with higher scores indicating more signifcant disability and 6 indicating death. The primary endpoint of the intervention effect, as assessed using the modifed Rankin scale, was analyzed using a hierarchical ordinal mixed-effect logistic regression model with four adjustment variables (fxed intervention effect [lying fat vs sitting up], fxed period effect, random cluster effect, and effect of the interaction between random cluster and period).

1.2.3.3 Stepped-Wedge Cluster Randomized Design Stepped-wedge cluster randomized trials (SW-CRTs) have been increasingly adopted in behavioral and health services research [82–84]. An SW-CRT is a type of crossover CRT, in which the different clusters switch interventions at

22

Design and Analysis of Pragmatic Trials

different time points. An SW-CRT extends the traditional CRT so that every cluster provides control and intervention observations, thus, somewhat acting as its own control. The design includes a baseline period where none of the clusters receive the intervention of interest. That is, all clusters receive the control at the baseline. Then, one or more randomly allocated clusters are scheduled to switch from a control to an intervention group at prespecifed time points until all clusters are exposed to the intervention of interest. Outcomes are measured at every time point. The key feature of the SW-CRT design is the unidirectional crossover of clusters from the control to the intervention groups on a staggered schedule, which induces confounding of the intervention effect over time. SW-CRTs have a unique feature that allows the stepwise introduction of interventions to a population, making it suitable to examine the effectiveness of interventions in real-world settings [103, 104]. Figure 1.3 illustrates a typical stepped-wedge cluster randomized design. The horizontal axis indicates time, and the vertical axis indicates the sequences. Each sequence includes one or more clusters of subjects. SW-CRTs have several advantages. First, every cluster ends up getting the intervention. Switching every cluster to intervention motivates patients and doctors and mitigates the ethical dilemma of withholding an effective intervention when it is believed to be superior to the control [105]. Second, it is more effcient because comparisons from two directions provide evidence of effectiveness, which will lead to higher statistical power over parallel CRTs [106]. One is vertical comparison of clusters between intervention and control at each time point. The other is horizontal comparisons of observations before and after switching to intervention within each cluster. Third, outcomes are measured longitudinally, offering insight into the temporal trend of the intervention effect. On the other hand, SW-CRTs present various challenges to researchers, which include longer trial duration, higher risk of attrition, diffculty in blinding, complex informed consent procedure, and complex data analysis. For example, the correlation structure is much more complicated than that in completely randomized cluster designs. The repeated measures may burden patients regarding quality of life, satisfaction, and costs. Moreover, the

FIGURE 1.3 An illustration of an SW-CRT with T = 4 time points and S = 3 sequences. (Each cell represents a data collection point. Shaded and blank cells represent intervention and control, respectively.)

Pragmatic Randomized Trials

23

proportion of missing data is usually higher due to repeated measurements over a longer duration, which cannot be ignored in the design. Many researchers have investigated sample size calculation methods for cross-sectional SW-CRTs [103, 107–109], where the correlation structure is more straightforward. There have been relatively fewer publications on sample size methods for closed-cohort SW-CRTs, some of which were purely simulation-based [110], some based on the linear mixed-effect model approach [94, 111], and some based on the Bayesian method [112]. Sample size formulas were proposed for closed-cohort SW-CRTs using the GEE approach [30, 32], which considered three types of correlations: within-period, interperiod, and within-individual correlations. Wang et al. [29, 30] provided the closed-form sample size formulas for SW-CRTs with continuous and binary outcomes using GEE, which can be applied to either the closed-cohort or cross-sectional SW-CRTs by modifying the correlation structures. The sample size method is fexible to address pragmatic issues such as missing data due to various mechanisms or complicated correlation structures. Mixed-effects regression and generalized estimating equation (GEE) have commonly been used in analyzing SW-CRTs. Li et al. [34] provided an overview of mixed-effects regression models that have been applied to the design and analysis of SW-CRTs. Example: The Acute Coronary Syndrome Quality Improvement in Kerala (ACS QUIK) trial was a pragmatic, cluster-randomized, stepped-wedge clinical trial in which 63 hospitals were randomized to receive the quality improvement tool kit intervention at 1 of 5 predefned, 4-month steps over 24 months after a period of usual care [113]. During fve predefned steps over a 24-month study period, hospitals were randomly selected to move in a one-way crossover from the control group to the intervention group. The primary outcome was the composite of all-cause death, reinfarction, stroke, or major bleeding at 30 days. Mixed-effects logistic regression models were used to analyze the primary outcome, 30-day major adverse cardiovascular events, with a random cluster (hospital) effect and a fxed time effect for every 4-month step. There was no statistically signifcant difference in 30-day composite event rates between the control phase of 6.4% and the intervention phase of 5.3% after adjusting for the effects of cluster and temporal trends.

References 1. D. Schwartz, and J. Lellouch. Explanatory and pragmatic attitudes in therapeutical trials. Journal of Clinical Epidemiology, 20(8):637–648, 1967. 2. D.L. Sackett. Clinician-trialist rounds: 16. mind your explanatory and pragmatic attitudes! – Part 1: What? Clinical Trials, 10(3):495–498, 2013.

24

Design and Analysis of Pragmatic Trials

3. K. Loudon, S. Traweek, F. Sullivan, P. Donnan, K. Thorpe, and M. Zwarenstein. The precis-2 tool: Designing trials that are ft for purpose. BMJ, 350:h2147, 2015. 4. R.E. Glasgow, T.M. Vogt, and S.M. Boles. Evaluating the public health impact of health promotion interventions: The re-aim framework. American Journal of Public Health, 89(9):1322–1327, 1999. 5. G. Gartlehner, R.A. Hansen, D. Nissman, K.N. Lohr, and T.S. Carey. A simple and valid tool distinguished effcacy from effectiveness studies. Journal of Clinical Epidemiology, 59(10):1040–1048, 2006. 6. K.E. Thorpe, M. Zwarenstein, A.D. Oxman, S. Treweek, C.D. Furberg, D.G. Altman, et al. A pragmatic–explanatory continuum indicator summary (precis): A tool to help trial designers. Journal of Clinical Epidemiology, 62(5):464–475, 2009. 7. N.A. Patsopoulos. A pragmatic view on pragmatic trials. Dialogues in Clinical Neuroscience, 13(2):217–224, 2011. 8. M. Rifai, D. Itchhaporia, and S. Virani. Pragmatic clinical trials—Ready for prime time? JAMA Network Open, 4(12):e2140212, 2021. 9. J.P.W. Bynum, D.A. Dorr, J. Lima, E.P. McCarthy, E. McCreedy, R. Platt, et al. Using healthcare data in embedded pragmatic clinical trials among people living with dementia and their caregivers: State of the art. Journal of the American Geriatrics Society, 68(Suppl 2):549–554, 2020. 10. F.P. Wilson, M. Shashaty, I. Testani, J. Aqeel, Y. Borovskiy, S.S. Ellenberg, et al. Automated, electronic alerts for acute kidney injury: A single-blind, parallelgroup, randomised controlled trial. Lancet, 385(9981):1988–1974, 2015. 11. B.L. Strom, R. Schinnar, F. Aberra, W. Bilker, S. Hennessy, C.E. Leonard, et al. Unintended effects of a computerized physician order entry nearly hard-stop alert to prevent a drug interaction: A randomized controlled trial. Archives of Internal Medicine, 170(17):1578–1583, 2010. 12. L.G. Hemkens, R. Saccilotto, S.L. Reyes, D. Glinz, T. Zumbrunn, O. Grolimund, et al. Personalized prescription feedback using routinely collected data to reduce antibiotic use in primary care: A randomized clinical trial. JAMA Internal Medicine, 177(2):176–183, 2017. 13. L.D. Fiore, and P.W. Lavori. Integrating randomized comparative effectiveness research with patient care. New England Journal of Medicine, 374(22):2152–2158, 2016. 14. K.A. McCord, R.A. Al-Shahi, S. Treweek, H. Gardner, D. Strech, W. Whiteley, et al. Routinely collected data for randomized trials: Promises, barriers, and implications. Trials, 19(1):29, 2018. 15. R.J. Little, R. D’Agostino, M.L. Cohen, K. Dickersin, S.S. Emerson, J.T. Farrar, et al. The prevention and treatment of missing data in clinical trials. New England Journal of Medicine, 367(14):1355–1360, 2012. 16. D.B. Rubin. Inference and missing data. Biometrika, 63(3):581–590, 1976. 17. D.B. Rubin. Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, 1987. 18. M. Taljaard, A. Donner, and N. Klar. Imputation strategies for missing continuous outcomes in cluster randomized trials. Biometrical Journal. Biometrische Zeitschrift, 50(3):329–345, 2008. 19. R.R. Andridge. Quantifying the impact of fxed effects modeling of clusters in multiple imputation for cluster randomized trials. Biometrical Journal. Biometrische Zeitschrift, 53(1):57–74, 2011.

Pragmatic Randomized Trials

25

20. M.L. Anderson, R.M. Califf, and J. Sugarman. Ethical and regulatory issues of pragmatic cluster randomized trials in contemporary health systems. Clinical Trials, 12(3):276–285, 2015. 21. V. Palin, T.P. Van Staa, S. Steels, et al. A frst step towards best practice recommendations for the design and statistical analyses of pragmatic clinical trials: A modifed Delphi approach. British Journal of Clinical Pharmacology, in press. 22. S. Hahn, S. Puffer, D.J. Torgerson, and J. Watson. Methodological bias in cluster randomised trials. BMC Medical Research Methodology, 5(1):10, 2005. 23. A. Donner, and N. Klar. Design and Analysis of Cluster Randomization Trials in Health Research. Oxford University Press, 2000. 24. A.K. Manatunga, M.G. Hudgens, and S. Chen. Sample size estimation in cluster randomized studies with varying cluster size. Biometrical Journal, 43(1):75–86, 2001. 25. S. Kang, C. Ahn, and S. Jung. Sample size calculations in cluster randomized studies with varying cluster size: A binary case. Drug Information Journal, 37(1):109–114, 2003. 26. N.M. Ivers, I.J. Halperin, J. Barnsley, J.M. Grimshaw, B.R. Shah, K. Tu, et al. Allocation techniques for balance at baseline in cluster randomized trials: A methodological review. Trials, 13:120, 2012. 27. X. Xu, H. Zhu, and C. Ahn. Sample size considerations for matched–pair cluster randomization design with incomplete observations of continuous outcomes. Contemporary Clinical Trials, 104:106336, 2021. 28. X. Xu, H. Zhu, A. Hoang, and C. Ahn. Sample size considerations for matched– pair cluster randomization design with incomplete observations of binary outcomes. Statistics in Medicine, 40(24):5397–5416, 2021. 29. J. Wang, J. Cao, C. Ahn, and S. Zhang. Power analysis for stepped wedge cluster randomized trials with binary outcomes. Statistical Theory and Related Fields, 5, 2021. 30. J. Wang, J. Cao, C. Ahn, and S. Zhang. Sample size determination for stepped wedge cluster randomized trials in pragmatic settings. Statistical Methods in Medical Research, 2021. 31. F. Li, A. Forbes, E. Turner, and J. Preisser. Power and sample size requirements for gee analyses of cluster randomized crossover trials. Statistics in Medicine, 38(4), 2021. 32. F. Li, E.L. Turner, and J.S. Preisser. Sample size determination for GEE analyses of stepped wedge cluster randomized trials. Biometrics, 74(4):1450–1458, 2018. 33. F. Li, L. Turner, and J. Preisser. Sample size determination for gee analyses of stepped wedge cluster randomized trials. Biometrics, 74(4):1450–1458, 2018. 34. F. Li, K. Hughes, K. Heming, M. Taljaard, E. Melnick, and P. Heagerty. Mixedeffects models for the design and analysis of stepped wedge cluster randomized trials: An overview. Statistical Methods in Medical Research, 30(2):612–639, 2021. 35. C. Ahn, M. Heo, and S. Zhang. Sample Size Calculations for Clustered and Longitudinal Outcomes in Clinical Research. CRC Press, 2014. 36. D.M. Murray. Design and Analysis of Group–Randomized Trials. Oxford University Press, 1998. 37. R.J. Hayes, and L.H. Moulton. Cluster Randomised Trials, 2nd Edition. Chapman and Hall/CRC, 2017.

26

Design and Analysis of Pragmatic Trials

38. S. Eldridge, and S. Kerry. A Practical Guide to Cluster Randomised Trials in Health Services Research. Wiley, 2012. 39. M.J. Campbell, and S.J. Walters. Ow to Design, Analyse and Report Cluster Randomised Trials in Medicine and Health Related Research. UK Wiley, 2014. 40. S. Teerenstra, and M. Moerbeek. Power Analysis of Trials with Multilevel Data. CRC Press, 2016. 41. C.S. Skinner, E.A. Halm, W.P. Bishop, C. Ahn, S. Gupta, D. Farrell, et al. Impact of risk assessment and tailored versus nontailored risk information on colorectal cancer testing in primary care: A randomized controlled trial. Cancer Epidemiology, Biomarkers and Prevention, 24(10):1523–1530, 2015. 42. E.L. Turner, F. Li, J.A. Gallis, M. Prague, and D.M. Murray. Review of recent methodological developments in group–randomized trials: Part 1–Design. American Journal of Public Health, 107(6):907–915, 2017. 43. E.L. Turner, M. Prague, J.A. Gallis, F. Li, and D.M. Murray. Review of recent methodological developments in group–randomized trials: Part 2–Analysis. American Journal of Public Health, 107(7):1078–1086, 2017. 44. A. Donner, N. Birkett, and C. Buck. Randomization by cluster-sample size requirements and analysis. American Journal of Epidemiology, 114(6):906–914, 1981. 45. S.M. Eldridge, D. Ashby, and S. Kerry. Sample size for cluster randomized trials: Effect of coeffcient of variation of cluster size and analysis method. International Journal of Epidemiology, 35(5):1292–1300, 2006. 46. J. Wang, S. Zhang, and C. Ahn. Sample size calculation for count outcomes in cluster randomization trials with varying cluster sizes. Communications in Statistics – Theory and Methods, 49(1):116–124, 2020. 47. L. Guittet, P. Ravaud, and B. Giraudeau. Planning a cluster randomized trial with unequal cluster sizes: Practical issues involving continuous outcomes. BMC Medical Research Methodology, 6(17):1–15, 2006. 48. C. Ahn, F. Hu, and C. Skinner. Effect of imbalance and intracluster correlation coeffcient in cluster randomized trials with binary outcomes. Computational Statistics and Data Analysis, 53(3):596–602, 2009. 49. C. Leyrat, K. Morgan, B. Leurent, and B. Kahan. Cluster randomized trials with a small number of clusters: Which analyses should be used? International Journal of Epidemiology, 47(1):321–331, 2018. 50. F.E. Satterthwaite. An approximate distribution of estimates of variance components. Biometrics, 2(6):110–114, 1946. 51. M.G. Kenward, and J.H. Roger. Small sample inference for fxed effects from restricted maximum likelihood. Biometrics, 53(3):983–997, 1997. 52. P. Li, and D.T. Redden. Comparing denominator degrees of freedom approximations for the generalized linear mixed model in analysing binary outcome in small sample cluster–randomized trials. BMC Medical Research Methodology, 15:38, 2015. 53. M.P. Fay, and B.I. Graubard. Small-sample adjustments for wald-type tests using sandwich estimators. Biometrics, 57(4):1198–1206, 2001. 54. L.A. Mancl, and T.A. DeRouen. A covariance estimator for GEE with improved small–sample properties. Biometrics, 57(1):126–134, 2001. 55. C.S. Skinner, S.M. Rawl, B.K. Moser, A.H. Buchanan, L.L. Scott, V.L. Champion, et al. Impact of the cancer risk intake system on patient-clinician discussions of tamoxifen, genetic counseling, and colonoscopy. Journal of General Internal Medicine, 20(4):360–365, 2005.

Pragmatic Randomized Trials

27

56. M.J. DeHaven, M.A. Ramos-Roman, N. Gimpel, J. Carson, J. DeLemos, S. Pickens, et al. The goodnews trial: A community-based participatory research trial with African-American church congregations for reducing cardiovascular disease risk factors – Recruitment, measurement, and randomization. Contemporary Clinical Trials, 32(5):630–640, 2011. 57. M. Taljaard, A. Donner, and N. Klar. Accounting for expected attrition in the planning of community intervention trials. Statistics in Medicine, 26(13):2615– 2628, 2007. 58. A. Kroeger, A. Lenhart, M. Ochoa, E. Villegas, N. Alexander, M. Levy, et al. Effective control of dengue vectors with curtains and water container covers treated with insecticide in mexico and Venezuela: Cluster randomised trials. British Medical Journal, 332:1247–1252, 2006. 59. K. Imai, G. King, and C. Nall. The essential role of pair matching in clusterrandomized experiments, with application to the Mexican universal health insurance evaluation. Statistical Science, 24(1):29–53, 2009. 60. D.C. Martin, P. Diehr, E.B. Perrin, and T.D. Koepsell. The effect of matching on the power of randomized community intervention trials. Statistics in Medicine, 12(3):329–338, 1993. 61. P. Diehr, D.C. Martin, T. Koepsell, and A. Cheadle. Breaking the matches in a paired t-test for community interventions when the number of pairs is small. Statistics in Medicine, 14(13):491–504, 1995. 62. A. Donner. Some aspects of the design and analysis of cluster randomization trials. Journal of the Royal Statistical Society, Series C, 47(1):95–113, 1998. 63. K. Liang. Odds ratio inference with dependent data. Biometrika, 72(3):678–682, 1985. 64. K. Liang, T.H. Beaty, and B.H. Cohen. Application of odds ratio regression models for assessing familial aggregation from case–control studies. American Journal of Epidemiology, 124(4):678–683, 1986. 65. M.J. Shipley, P.G. Smith, and M. Dramax. Calculation of power for matched pair studies when randomization is by group. International Journal of Epidemiology, 18(2):457–461, 1989. 66. C. Rutterford, A. Copas, and S. Eldridge. Methods for sample size determination in cluster randomized trials. International Journal of Epidemiology, 44(3):1051– 1067, 2015. 67. M.H. Gail, D.P. Byar, T.F. Pechacek, and D.K. Corle. Aspects of statistical design for the community intervention trial for smoking cessation (commit). Controlled Clinical Trials, 13(1):16–21, 1992. 68. D. Schwartz, R. Flamant, and J. Lellouch. Clinical Trials. Academic Press, 1980. 69. A. Donner. Sample size requirements for stratifed cluster randomization designs. Statistics in Medicine, 11(6):743–750, 1992. 70. R.F. Woolson, J.A. Bean, and P.B. Rojas. Sample size for case-control studies using cochran’s statistic. Biometrics, 42(4):927–932, 1986. 71. X. Xu, H. Zhu, and C. Ahn. Sample size considerations for stratifed cluster randomization design with binary outcomes and varying cluster size. Statistics in Medicine, 38(18):3395–3404, 2019. 72. J. Wang, S. Zhang, and C. Ahn. Power analysis for stratifed cluster randomisation trials with cluster size being the stratifying factor. Statistical Theory and Related Fields, 1(1):121–127, 2017. 73. H. Moulton. Covariate-based constrained randomization of group-randomized trials. Clinical Trials, 1(3):297–305, 2004.

28

Design and Analysis of Pragmatic Trials

74. L.M. Dickinson, B. Beaty, C. Fox, W. Pace, W.P. Dickinson, C. Emsermann, and A. Kempe. Pragmatic cluster randomized trials using covariate constrained randomization: A method for practice-based research networks (PBRNs). Journal of the American Board of Family Medicine, 28(5):663–672, 2015. 75. A. Liu, W.J. Shih, and E. Gehan. Sample size and power determination for clustered repeated measurements. Statistics in Medicine, 21(12):1787–1801, 2002. 76. J. Wang, J. Cao, S. Zhang, and C. Ahn. A fexible sample size solution for longitudinal and crossover cluster randomized trials with continuous outcomes. Contemporary Clinical Trials, 109:106543, 2021. 77. J. Parienti, and O. Kuss. Cluster-crossover design: A method for limiting clusters level effect in community-intervention studies. Contemporary Clinical Trials, 28(3):316–323, 2007. 78. C. Rietbergen, and M. Moerbeek. The design of cluster randomized crossover trials. Journal of Educational and Behavioral Statistics, 36(4):472–490, 2011. 79. S.J. Arnup, J.E. McKenzie, K. Hemming, D. Pilcher, and A. Forbes. Understanding the cluster randomised crossover design: A graphical illustration of the components of variation and a sample size tutorial. Trials, 18(1):381, 2017. 80. C.S. Anderson, H. Arima, P. Lavados, L. Billot, M.L. Hackett, V. Olavarra, et al. Cluster-randomized, crossover trial of head positioning in acute stroke. New England Journal of Medicine, 376(25):2437–2447, 2017. 81. C.N. Mhurchu, D. Gorton, M. Turley, Y. Jiang, J. Michie, R. Maddison, et al. Effects of a free school breakfast programme on children’s attendance, academic achievement and short–term hunger: Results from a stepped–wedge, cluster randomised controlled trial. Journal of Epidemiology and Community Health, 67(3):257–264, 2013. 82. L.L. Bailet, K.K. Repper, S.B. Piasta, and S.P. Murphy. Emergent literacy intervention for prekindergarteners at risk for reading failure. Journal of Learning Disabilities, 42(4):336–355, 2009. 83. G. Bacchieri, A.J. Barros, J.V. Santos, H. Gonalves, and D.P. Gigante. A community intervention to prevent traffc accidents among bicycle commuters. Revista de Saude Publica, 44(5):867–875, 2010. 84. B.J. van Holland, M.R. de Boer, S. Brouwer, R. Soer, and M.F. Reneman. Sustained employability of workers in a production environment: Design of a stepped wedge trial to evaluate effectiveness and cost-beneft of the pose program. BMC Public Health, 12:1003, 2012. 85. K. Hemming, M. Taljaard, J.E. McKenzie, R. Hooper, A. Copas, J.A. Thompson, et al. Reporting of stepped wedge cluster randomised trials: Extension of the consort 2010 statement with explanation and elaboration. BMJ, 363:k1614, 2018. 86. K. Hemming, J. Kasza, A. Hooper, R. Forbes, and A. Taljaard. A tutorial on sample size calculation for multiple-period cluster randomized parallel, cross-over and stepped-wedge trials using the shiny crt calculator. International Journal of Epidemiology, 49(3):979–995, 2020. 87. H.A. Feldman, and S.M. McKinlay. Cohort versus cross-sectional design in large feld trials: Precision, sample size, and a unifying model. Statistics in Medicine, 13(1):61–78, 1994. 88. S. Sabo, C. Denman, M.L. Bell, E. Cornejo Vucovich, M. Ingram, C. Valenica, et al. Meta salud diabetes study protocol: A cluster-randomised trial to reduce cardiovascular risk among a diabetic population of Mexico. BMJ Open, 8(3):e020762, 2018.

Pragmatic Randomized Trials

29

89. H.B. Bosworth, M.K. Olsen, T. Dudley, M. Orr, M.K. Goldstein, S.K. Datta, et al. Patient education and provider decision support to control blood pressure in primary care: A cluster randomized trial. American Heart Journal, 157(3):450– 456, 2009. 90. G. Grandes, A. Sanchez, I. Montoya, R. Ortega Sanchez-Pinilla, J. Torcal; PEPAF Group. Two-year longitudinal analysis of a cluster randomized trial of physical activity promotion by general practitioners. Plos One, 6(3):e18363, 2011. 91. M. Heo, Y. Kim, X. Xue, and M. Kim. Sample size requirement to detect an intervention effect at the end of follow-up in a longitudinal cluster randomized trial. Statistics in Medicine, 29(3):382–390, 2010. 92. M. Heo, and A.C. Leon. Sample size requirements to detect an intervention by time interaction in longitudinal cluster randomized clinical trials. Statistics in Medicine, 28(6):1017–1027, 2009. 93. M. Heo, X. Xue, and M. Kim. Sample size requirements to detect a two– or three–way interaction in longitudinal cluster randomized clinical trials with second–level randomization. Clinical Trials, 11(4):503–507, 2014. 94. R. Hooper, S. Teerenstra, E. de Hoop, and S. Eldridge. Sample size calculation for stepped wedge and other longitudinal cluster randomised trials. Statistics in Medicine, 35(26):4718–4728, 2016. 95. J. Kasza, R. Hooper, A. Copas, and A.B. Forbes. Sample size and power calculations for open cohort longitudinal cluster randomized trials. Statistics in Medicine, 39(13):1871–1883, 2020. 96. S. Patterson, L. McDaid, K. Saunders, C. Battison, A. Glasier, A. Radley, et al. Improving effective contraception uptake through provision of bridging contraception within community pharmacies: Findings from the bridge-it study process evaluation. BMJ Open, 12(2):e057348, 2022. 97. C.T. Damsgaard, S.M. Dalskov, R.A. Petersen, L.B. Sørensen, C. Mølgaard, A. Biltoft-Jensen, et al. Design of the opus school meal study: A randomised controlled trial assessing the impact of serving school meals based on the new Nordic diet. Scandinavian Journal of Public Health, 40(8):693–703, 2012. 98. C.T. Damsgaard, S.M. Dalskov, R.P. Laursen, C. Ritz, M.F. Hjorth, L. Lauritzen, et al. Provision of healthy school meals does not affect the metabolic syndrome score in 811-year-old children, but reduces cardiometabolic risk markers despite increasing waist circumference. British Journal of Nutrition, 112(11):1826–1836, 2014. 99. G.R. Norman, and S. Daya. The alternating-sequence design (or multiple– period crossover) trial for evaluating treatment effcacy in infertility. Fertility and Sterility, 74(2):319–324, 2000. 100. K.L. Grantham, J. Kasza, S. Heritier, K. Hemming, E. Litton, and A.B. Forbes. How many times should a cluster randomized crossover trial cross over? Statistics in Medicine, 38(25):5021–5033, 2019. 101. B. Giraudeau, P. Ravaud, and A. Donner. Sample size calculation for cluster randomized cross-over trials. Statistics in Medicine, 27(27):5578–5585, 2008. 102. A.B. Forbes, M. Akram, D. Pilcher, J. Cooper, and R. Bellomo. Cluster randomised crossover trials with binary data and unbalanced cluster sizes: Application to studies of near–universal interventions in intensive care. Clinical Trials, 12(1):34–44, 2015. 103. M.A. Hussey, and J.P. Hughes. Design and analysis of stepped wedge cluster randomized trials. Contemporary Clinical Trials, 28(2):182–191, 2007.

30

Design and Analysis of Pragmatic Trials

104. C.A. Brown, and R.J. Lilford. The stepped wedge trial design: A systematic review. BMC Medical Research Methodology, 6(1):54, 2006. 105. S.J. Edwards. Ethics of clinical science in a public health emergency: Drug discovery at the bedside. The American Journal of Bioethics, 13(9):3–14, 2013. 106. J.R. Hargreaves, A. Prost, K.L. Fielding, and A.J. Copas. How important is randomisation in a stepped wedge trial? Trials, 16(1):359, 2015. 107. K. Hemming, T.P. Haines, P.J. Chilton, A.J. Girling, and R.J. Lilford. The stepped wedge cluster randomised trial: Rationale, design, analysis, and reporting. BMJ, 350:h391, 2015. 108. W. Woertman, E. de Hoop, M. Moerbeek, S.U. Zuidema, D.L. Gerritsen, and S. Teerenstra. Stepped wedge designs could reduce the required sample size in cluster randomized trials. Journal of Clinical Epidemiology, 66(7):752–758, 2013. 109. L.H. Moulton, J.E. Golub, B. Durovni, S.C. Cavalcante, A.G. Pacheco, V. Saraceni, et al. Statistical design of THRio: A phased implementation clinic-randomized study of a tuberculosis preventive therapy intervention. Clinical Trials, 4(2):190– 199, 2007. 110. G. Baio, A. Copas, G. Ambler, J. Hargreaves, E. Beard, and R.Z. Omar. Sample size calculation for a stepped wedge trial. Trials, 16(1):354, 2015. 111. A.J. Girling, and K. Hemming. Statistical effciency and optimal design for stepped cluster studies under linear mixed effects models. Statistics in Medicine, 35(13):2149–2166, 2016. 112. K.M. Cunanan, B.P. Carlin, and K.A. Peterson. A practical bayesian stepped wedge design for community-based cluster-randomized clinical trials: The British columbia telehealth trial. Clinical Trials, 13(6):641–650, 2016. 113. M.D. Huffman, P. Mohanan, R. Devarajan, A.S. Baldridge, D. Kondal, L. Zhao, et al. Effect of a quality improvement intervention on clinical outcomes in patients in india with acute myocardial infarction: The acs quik randomized clinical trial. JAMA, 319(6):567–578, 2018.

2 Cluster Randomized Trials

2.1 Introduction A cluster randomized trial (CRT) randomizes groups (clusters) of individuals to each intervention of a trial rather than individuals [1–4]. The clusters used in CRTs vary widely and range in size from families to entire communities. CRTs have become increasingly used in health services research because they are ideally suited to address issues related to policy, practice, and healthcare organizations with groups defned by social organizations or geographical areas such as schools, medical clinics, hospitals, churches, or communities. Advantages of CRTs include administrative convenience, lower implementation cost, ease of obtaining cooperation from investigators, ethical consideration, enhancement of subject compliance, avoidance of contamination between intervention and control groups, and intervention naturally applied at the cluster level [1]. A critical feature of cluster randomization trials is the dependency among subjects within the same cluster. Outcomes for individuals within a cluster are likely to be more similar to each other than those for individuals from other clusters due to the following possible reasons [5]. First, individuals tend to choose the cluster to which they belong. For example, individuals with higher income and education may select a particular neighborhood for living. Second, cluster-level covariates may affect all individuals within the cluster in the same manner. For example, some physicians may prefer a particular intervention, which will ensure that all participants under that physician will receive the same intervention. Third, participants with a cluster can infuence each other and therefore respond similarly. The degree of similarity or correlation within the same cluster is typically measured by an intracluster correlation coeffcient (ICC). The ICC, denoted as r , quantifes the portion of the total variance of the outcome variable that is accounted for by the between-cluster variance. The ICC profoundly impacts data analysis and study design including sample size estimation. If one ignores the clustering effect and analyzes clustered data using standard statistical methods developed for the analysis of independent observations, one may underestimate the true p-value and infate the type I error rate. DOI: 10.1201/9781003126010-2

31

32

Design and Analysis of Pragmatic Trials

Therefore, clustered data should be analyzed using statistical methods that take account of the dependence of within-cluster observations. If one fails to consider the clustered nature of the study design during the planning stage, one will obtain a smaller sample size estimate and statistical power than planned. Compared to an individually randomized trial (IRT), the intervention effect estimated in a CRT of equal cluster size (m) has a larger variance, by a variance infation factor (VIF) of {1 + (m - 1)r } due to the correlation among subjects within the same cluster [6]. This variance infation factor is also called the design effect (DE). The number of subjects under CRTs can be obtained by multiplying the design effect by the number of subjects calculated under IRTs to maintain the required level of power. The number of subjects in each intervention group ( nCRT ) in a CRT is nCRT = nI × {1 + (m - 1)r } , where nI is the number of subjects required if we were to undertake an IRT. If cluster size is 1 (m = 1) or the ICC ( r ) is zero (no variation between clusters), DE is 1, and the number of subjects for a CRT is the same as that for an IRT. That is, nCRT = nI if m = 1 or r = 0. If the ICC is 1 (no variation within a cluster), the number of subjects for a CRT is obtained by multiplying the number of subjects for an IRT by a cluster size m. That is, nCRT = nI × m. As r increases, the required number of subjects increases, and the difference can be signifcant. For m = 50, the infation in the required number of subjects will be by a factor of 1.049, 1.49, and 5.9 for r = 0.001, 0.01, and 0.1, respectively. Randomizing clusters requires considerably larger numbers of subjects than randomizing individual subjects. Cornfeld [7] pointed out the need for special consideration of the statistical features of cluster randomized trials and, in particular, the need to account for between-cluster variation. Donner and Klar [1] pointed out that analysis must also consider variation in cluster size, which is often substantial. Investigators frequently use the design effect (DE) to conduct sample size calculations and hypothesis tests that are appropriately adjusted for the impact of clustering on an outcome’s variance. Calculating a sample size that produces adequate power under the assumption of an IRT, then multiplying that sample size by the DE, ensures that a cluster randomized design is of equal statistical power [1]. Similarly, calculating a standard two-sample t-statistics to test hypotheses while treating observations as independent, then dividing these statistics by the square root of the DE, respectively, produces appropriate cluster-adjusted tests [8]. In CRTs, inferences are often applied at the individual level, while randomization is done at the cluster level.

2.2 Continuous Outcomes Continuous outcomes frequently occur in cluster randomized trials. For example, Quinn et al. [9] evaluated the effectiveness of adding mobile

33

Cluster Randomized Trials

application coaching and patient/provider web portals to community primary care in reducing glycated hemoglobin levels in patients with type 2 diabetes with a CRT. In the mobile diabetes intervention study, 26 primary care practices were randomly assigned to one of three-stepped treatment groups or a control group (usual care). Frost and Lane [10] investigated whether a standardized and manualized care intervention delivered in primary care achieves superior symptomatic outcomes for lower urinary tract symptoms versus usual care using a CRT. Practices were randomized 1:1 to either an intervention or a usual care pathway. Victor et al. [11] conducted a CRT to investigate if the participants at barbershops with the pharmacist-led intervention had a more signifcant reduction in systolic blood pressure after six months than those in the control group. Barbershops were randomly assigned to a pharmacist-led intervention or to an active control approach. Many CRTs aim to compare the means between the intervention and control groups. That is, the goal is to test H 0 : m1 = m2 against H1 : m1 ¹ m2 , where m1 and m2 are the means of the continuous outcome in the intervention and control groups. Cluster-level analysis consists of calculating a summary measure of the outcome for each cluster, such as a cluster mean or median, and then analyzing the summaries using standard statistical tests for independent data. Individual-level analysis allows all the individual-level data to be utilized, accounting for the intracluster correlation. Adjustments can now be made to simple statistical tests to account for the clustering effect. For example, test statistics based on the t-test should be divided by the square root of the design effect [12]. Individual-level analysis can be done using adjusted two-sample t-tests, mixed-effects linear regression models, and generalized estimating equation (GEE) approaches. Sample size formulas have been developed for CRTs with continuous outcomes with varying cluster sizes since cluster sizes are not constant in many CRTs [6, 13, 14]. For simplicity, we assume an equal allocation of clusters between two intervention groups, n1 = n2 = n , where nk is the number of clusters in the kth intervention ( k = 1, 2 ). Let k = 1 and 2 denote the intervention and control groups, respectively. Let mik be the cluster size of the ith cluster of the kth intervention and be independently and identically distributed with mean q and variance t2 . Let Yijk be the continuous random variable denoting the response of the jth ( j = 1, ˜ , mik ) subject in the ith ( i = 1, ˜ , n ) cluster of the kth intervention, where Yijk is assumed to be normally distributed with E(Yijk ) = mk and variance S . Let Yik denote the mean response computed over all mik subjects in the ith cluster of the kth intervention. Then, the mean of Yijk over all subjects in intervention group k is

å mY Y = å m n

k

i=1 n

i=1

ik ik ik

.

34

Design and Analysis of Pragmatic Trials

å å 2

n

The total number of subjects is M = mik and the total number of k=1 i=1 clusters is n1 + n2 = 2n. The degree of dependence within clusters is measured by the ICC ( r ). The ICC is the proportion of the total variance in response that can be accounted for by the between-cluster variance. Thus, r can be written as

r = s B2 / s 2 = s B2 /(s B2 + s W2 ), where s B2 is the between-cluster variance, s W2 is the within-cluster variance, and s 2 is the total variance. The ICC can be estimated by analysis of variance (ANOVA) estimate [15, 16]. Let MSB and MSW denote the mean square error between and within clusters. Then, the ANOVA estimator of r is given by

rˆ =

MSB - MSW 2 ), = SB2 /(SB2 + SW MSB + (m* - 1)MSW

where SB2 = (MSB - MSW)/ m* and SW2 = MSW are sample estimates of s B2 and s W2 , respectively. Here, 2

MSB =

n

(Yik - Yk )2 , 2(n - 1)

åå

mik

k=1 i=1 2

MSW =

mik

n

ååå k=1 i=1 j=1

mAk

å = å

n i=1 n i=1

(Yijk - Yik )2 , M - 2n mik2 mik

,

and

*

m =

M-

å

2 k=1

mAk

2(n - 1)

.

Donner and Klar [1] provided statistical methods to analyze continuous outcomes in cluster randomization trials. The data from CRTs can be analyzed using an adjusted two-sample t-test, which is based on a simple extension of the standard two-sample t-test. A more fexible but computationally intensive GEE and a mixed-effects model can be used to analyze outcomes incorporating the hierarchical structure of the data.

35

Cluster Randomized Trials

2.2.1 Standard Two-Sample t-Test

å å n

mik

å

n

An unbiased estimator of mk can be expressed as mˆ k = Yijk / mik . i=1 j=1 i=1 The difference of mˆ 1 - mˆ 2 estimates the population mean difference m1 - m2 . A standard error of mˆ 1 - mˆ 2 is computed by considering the randomization by cluster to set up the test statistic for testing H 0 : m1 = m2 against H1 : m1 ¹ m2 . If one applies a standard two-sample t-test that ignores the effect of clustering, the magnitude of bias associated with t-statistic increases with the value of r and the average cluster size ( m ) [1]. If one ignores the effect of clustering for clusters of fxed size m, the bias is approximately given by the square root of the design effect (DE), i.e., by 1 + (m - 1)r . That is, calculating a standard twosample t-statistics to test hypotheses, but then dividing these statistics by the square root of the DE produces appropriate cluster-adjusted t-test statistics [1, 8]. 2.2.2 Adjusted Two-Sample t-Test We assume that the correlation structure is exchangeable within a cluster since all the subjects receive the same intervention within each cluster. Yijk is assumed to be normally distributed with E(Yijk ) = mk and variance S . Thus, the diagonal elements of S are s 2 , and the off-diagonal elements are rs 2. Conditional on the cluster size mik , mˆ k follows a normal distribution with mean mk and variance Vk , where

å V = k

n i=1

mik {1 + (mik - 1)r }s 2 æ ç è

å

mik ö÷ i=1 ø n

2

.

The variance of mˆ k can be estimated by Vˆk =

Cks 2

å

n i=1

mik

,

where the variance infation factor, Ck , is

å C = k

n i=1

mik {1 + (mik - 1)rˆ }

å

n i=1

mik

.

The standard error of mˆ 1 - mˆ 2 is SE( mˆ 1 - mˆ 2 ) = Vˆ1 + Vˆ2 . The test statistic for testing H 0 : m1 = m2 against H1 : m1 ¹ m2 is T=

mˆ 1 - mˆ 2 . Vˆ1 + Vˆ2

36

Design and Analysis of Pragmatic Trials

T-statistic follows a t-distribution with 2n - 2 degrees of freedom. If r = 0 , the above T-statistic reduces to the standard t-test statistic that compares the means of two independent samples. The null hypothesis H 0 : m1 = m2 is rejected if the absolute value of the test statistic T is larger than t1-a/2 (2n - 2) , the 100(1 - a / 2) percentile of the t-distribution with 2n - 2 degrees of freedom. For a large n, based on the asymptotic theory, we reject H 0 : m1 = m2 if the absolute value of the test statistic T is larger than z1-a/2 , the 100(1 - a / 2) percentile of the standard normal distribution. Manatunga et al. [13] derived the number of clusters per group (n) for continuous outcomes in varying cluster sizes with a two-sided signifcance level of a and power of 1 - g for testing H 0 : m1 = m2 against the alternative hypothesis H1 : m1 ¹ m2 . Since mik ’s are independent and identically distributed random variables, by the law of large numbers, as n ® ¥ , nVk ®

E[m{1 + (m - 1)rˆ }]s 2 , E(m)2

where m is the random variable associated with the cluster size and E(×) is the expectation for the distribution of the cluster size. When the number of clusters (n) is suffciently large, Vˆ1 and Vˆ2 are consistent estimates for V1 and V2 . The number of clusters needed to achieve a power of 1 - g is n=

2s 2 ( z1- a/2 + z1- g )2 E[m{1 + (m - 1)rˆ }] . ( m1 - m2 )2 E(m)2

Let x = t / q , where E(m) = q , Var(m) = t2 , and x is the coeffcient of variation of the cluster size. Then, the required number of clusters in each intervention group can be expressed as n=

2s 2 ( z1-a/2 + z1- g )2 1 {(1 - rˆ ) + rˆ + rˆx2 }, 2 ( m1 - m2 ) q

which estimates the total number of clusters by accounting for variability in cluster size. If there is no variability in cluster size with mik = m , then the number of clusters in each group is n=

2s 2 ( z1-a/2 + z1-g )2 {1 + (m - 1)rˆ } . m( m1 - m2 )2

(2.1)

Since the number of clusters is usually small in community intervention trials, the use of critical values z1-a/2 and z1-g in Equation (2.1) instead of critical values t1-a/2 and t1-g corresponding to the t-distribution underestimates the required sample size. To adjust for underestimation, one cluster per group may be added when the sample size is determined with a 5% level of signifcance and two clusters per group with a 1% level of signifcance [17].

37

Cluster Randomized Trials

This sample size formula for the number of clusters per group can be easily extended to accommodate unbalanced randomization. Suppose the intervention:control randomization ratio is r : 1 . Then, the number of clusters required for the control arm is 1 ö s 2 (z1-a/2 + z1-g )2 1 æ n2 = ç 1 + ÷ {(1 - rˆ ) + rˆ + rˆx2 }. 2 m m q ) r ( è ø 1 2

(2.2)

The number of clusters required for the intervention arm is n1 = r × n2 . It is noteworthy that when x = 0, i.e., there is no variability in cluster size, so all mik = q , the required sample size becomes 1 ö s 2 (z1-a/2 + z1-g )2 1 æ {(1 - rˆ ) + rˆ }. n2* = ç 1 + ÷ 2 m m q r ( ) è ø 1 2 Comparing n2 and n2* , the relative change in the number of clusters due to random variability in cluster size can be quantifed by R=

rq n2 x 2. -1= n2* 1 - r + rq

Note that: • R does not depend on design parameters such as signifcance level a , power 1 - g , randomization ratio, m1 , and m2 . • R is proportional to x 2 , the squared coeffcient of variation in cluster sizes. • Under a positive ICC, r > 0 , which is usually the case in practice, we have R > 0 . The variability in cluster size always leads to a larger sample size requirement. Therefore, ignoring the variability in cluster size will lead to underpowered trials. • R ® 0 as r ® 0 . From an equivalent expression, R=

qx 2 , 1- r +q r

R approaches 0 as r approaches 0. Note that R monotonically increases as r increases for r > 0 . Hence, given the same coeffcient of variation ( x ), larger r is associated with a greater increase in sample size requirement. • From another equivalent expression, R=

rx2 , 1- r +r q

38

Design and Analysis of Pragmatic Trials

we observe that R monotonically increases as the mean cluster size ( q ) increases. Moreover, given the same coeffcient of variation ( x ) and intracluster correlation ( r ), a larger average cluster size is associated with a greater increase in sample size requirement. When the cluster sizes are not constant, a commonly used approach is to replace the cluster size ( mik ) with an advanced estimate of the average cluster size ( m ), which was referred to as the average cluster size method [13, 18]. However, this approximate average cluster size method is likely to underestimate the actual required sample size [1, 13, 18]. Example: A cluster randomization trial was conducted to evaluate the effectiveness of adding mobile application coaching and patient/provider web portals to community primary care in reducing glycated hemoglobin levels in patients with type 2 diabetes [9]. Patients from the same primary care provider (PCP) received the same intervention. Linear mixed-effects models were used to investigate if there was a signifcant difference in changes in glycated hemoglobin (%) from baseline to 12 months between usual care (UC) and maximal treatment groups. Maximal treatment was a mobile- and web-based self-management patient coaching system and provider decision support. In the mixedeffects models, random effects accounted for within-practice clustering and within-patient correlation, and fxed effects were treatment group indicator, time indicator, and interaction between treatment group and time. There was a signifcant difference in changes in glycated hemoglobin between the maximal treatment group (1.9%) and the usual care group (0.7%), a difference of 1.2% over 12 months (p < 0.001).

Suppose that the investigators are interested in conducting a completely randomized cluster trial based on the estimated design parameters to achieve 90% power at a two-sided signifcance level of 5%. Primary care providers will be equally randomized either to receive the coach PCP portal with a decision support (CPDS) or usual care. Patients from the same PCP will receive the same treatment. We assume that the mean decline in glycated hemoglobin will be 1.9% in the maximal treatment group and 0.7% in the usual care group, a difference of 1.2% over 12 months. We assume the common standard deviation of change in glycated hemoglobin over 12 months is 2.8%. We expect an intracluster correlation of 0.10, similar to a previously reported study [19]. If every PCP enrolls exactly 13 patients, the required number of clusters and patients is 20 PCPs and 260 patients in each intervention group. With a cluster size following a uniform distribution U[4, 22], the required number of clusters and patients is 21 PCPs and 273 patients in each intervention group, respectively. Finally, if more patients are expected to be enrolled at each clinic where the cluster size follows a uniform distribution of U[15, 35], the required number of clusters and patients is 16 PCPs and 400 patients per intervention group, respectively. This example illustrates that the number of patients increases as the variability in cluster size increases.

39

Cluster Randomized Trials

2.2.3 Generalized Estimating Equation Method Generalized estimating equation (GEE) is a common modeling approach used in cluster randomized trials to account for within-cluster correlation [20]. A major appeal of the GEE approach is that the asymptotic normality of the sandwich estimator is robust, even if the correlation structure is misspecifed. For simplicity, assume that the number of subjects in each cluster is the same. That is, mi = m , where mi is the number of subjects in cluster i. Let n1 and n2 be the number of clusters in the intervention and control groups, respectively. Let Yij denote the continuous response measurement obtained from a subject j ( j = 1, ˜ , m ) in a cluster i ( i = 1, ˜ , N ), where m is the number of subjects in cluster i, and N = n1 + n2 is the total number of clusters enrolled in the study. The clusters are randomly assigned to a control or an intervention group, indicated by ri = 0 or 1, respectively. We use r=

å

N

ri / N to denote the proportion of clusters assigned to the interven-

i=1

tion group. The CRT is a balanced design with n1 = n2 for r = 0.5. The aim is to compare the means between the two intervention groups. That is, the goal is to test H 0 : m1 = m2 against H1 : m1 ¹ m2 , where m1 and m2 are the means of the continuous outcome in the intervention and control groups. To make an inference based on the mean difference between the two groups, we assume the following statistical model, Yij = b1 + b 2 × ri + eij , where parameter b1 models the intercept effect, parameter b 2 models the difference in continuous responses between two groups, and eij denotes a random error. Testing H 0 : m1 - m2 = 0 is equivalent to testing hypothesis H 0 : b 2 = 0 . It is usually assumed that E(eij ) = 0 and Var(eij ) = s 2 . We assume that the correlation structure is exchangeable within a cluster since all the subjects receive the same intervention within each cluster. To simplify derivation, we rewrite the model as Yij = b1 + b2 (ri - r ) + eij , where b1 = b1 + b 2 r and b2 = b 2 . Defne the vector of covariates Zij = (1, ri - r )¢ . The GEE estimator bˆ = (bˆ1 , bˆ2 ) solves equation SN ( b) = 0 , where N

SN ( b) = N -1/2

m

åå(Y - b’Z )Z ’. ij

ij

ij

i=1 j=1

That is, æ bˆ = ç ç è

N

ö Zij Zij’÷ ÷ j=1 ø m

åå i=1

-1

N

m

ååZ Y . ij ij

i=1 j=1

40

Design and Analysis of Pragmatic Trials

We use the independent working correlation structure to derive the GEE estimator. Following Liang and Zeger [20], N (bˆ - b) is approximately normal with a mean of 0 and variance S N = AN-1VN AN-1 , with N

AN = N -1

m

N

åå

Zij Zij’ = N -1

i=1 j=1

m

ri - r ö ÷, (ri - r )2 ø

æ1

åå çè r - r i

i=1 j=1

and N

VN = N

-1

æ ç ç è

öæ Zijeˆ ij ÷ ç ÷ç j=1 øè m

åå i=1



ö Zijeˆ ij ÷ = N -1 ÷ j=1 ø m

å

N

m

m

åååeˆ eˆ

ij ij¢

1 j=1 j¢=1 i=1

æ1 ç è ri - r

ri - r

ö ÷, (ri - r ) ø 2

and eˆ ij = Yij - bˆ’Zij . Hence, we reject the hypothesis H 0 : b2 = 0 , which is equivalent to reject the hypothesis H 0 : b 2 = 0 , if the absolute value of Z = Nbˆ2 / sˆ 2 is greater than z1-a/2 . Here sˆ 22 is the (2, 2) th element in S N , and z1-a/2 is the 100(1 - a / 2) th percentile of a standard normal distribution. At the design stage, researchers need to determine the number of clusters required to test the hypotheses H 0 : m1 = m2 versus H1 : m1 ¹ m2 with a twosided signifcance level of a and power 1 - g . Let r =Corr(eij , eij¢ ) for j ¹ j¢ and rjj = 1 . It can be shown that, as N ® ¥ , AN ® A and VN ® V , with æ1 A = mç è0

0 ö ÷, s r2 ø

and æ1 V = ms 2 [1 + (m - 1)r ] ç è0

0 ö ÷, s r2 ø

where s r2 = r (1 - r ) . It is easy to show that the (2, 2) th element of S = A -1VA -1 is

s 22 =

s 2 [1 + (m - 1)r ] . ms r2

Then the total number of clusters needed for testing H 0 : b 2 = 0 with a twosided type I error of a and power 1 - g is N=

s 22 (z1-a /2 + z1-g )2 s 2 [1 + (m - 1)r ](z1-a /2 + z1-g )2 = . ms r2 ( m1 - m2 )2 b 22

The number of clusters required is Nr and N(1 - r ) for intervention and control groups, respectively. For example, r = 0.5 indicates a balanced design while r = 0.67 implies a 2 : 1 randomization ratio between the intervention

41

Cluster Randomized Trials

and control group with n1 = 2n2 . The total number of clusters needed for testing the hypothesis H 0 : m1 = m2 in a balanced design is N=

4s 2 ( z1-a/2 + z1-g )2 [1 + (m - 1)r ] . m( m1 - m2 )2

The number of clusters needed per group is N / 2 , which is equivalent to the number of clusters per group in Equation (2.1). The total number of subjects needed for testing the hypothesis H 0 : m1 = m2 in a balanced design can be written NSubj = m × N =

4s 2 ( z1-a/2 + z1-g )2 [1 + (m - 1)r ] . ( m1 - m2 )2

As a special case with a cluster size of one ( m = 1 ), the required total number of subjects is N = NSubj = n1 + n2 =

4s 2 ( z1- a/2 + z1-g )2 , ( m1 - m2 )2

which is the sample size estimate commonly used for independent observations. 2.2.4 Mixed-Effects Linear Regression Models The continuous outcome Yij is denoted in the same way as in the GEE method. The hypothesis testing H 0 : m1 = m2 is based on a mixed-effects model [4], Yij = b1 + b 2 × ri + ui + eij ,

(2.3)

where Yij is the continuous response of the jth subject in the ith cluster, the parameter b1 models the intercept effect, the parameter b 2 is the intervention effect of interest, and the indicator variable, ri , is 1 if cluster i is assigned to an intervention group and 0 otherwise. A random effect term for the ith cluster, ui , is distributed as N(0, s u2 ) , and a random effect for the jth subject in the ith cluster, eij , is distributed as N(0, s e2 ) . That is, the variance of the cluster random effect is s u2 and the variance of the subject random effects is s e2 . The intracluster correlation is r = Corr(Yij , Yij¢ ) = s u2 / s 2 , where s 2 = s u2 + s e2 , and s 2 is the variance of Yij . The subjects are randomly assigned to a control or an intervention group, indicated by ri = 0 or 1, respectively. We use r =

å

N

ri / N to denote the

i=1

proportion of subjects in the intervention group. The aim is to compare the means between the two groups. That is, the goal is to test H 0 : m1 = m2 against H1 : m1 ¹ m2 , where m1 and m2 are the means of the continuous outcome in the intervention and control groups, respectively. The test for the difference

42

Design and Analysis of Pragmatic Trials

of the two group means is the same as the test of b 2 in the mixed-effects model analysis. That is, testing H 0 : m1 - m2 = 0 is equivalent to testing the hypothesis H 0 : b 2 = 0 . The unbiased estimators for the means of the outcome Yij are

mˆ 1 =

å å N

m

i=1

j=1

Yij ri /(mN) and mˆ 2 =

å å N

m

i=1

j=1

Yij (1 - ri )/(mN) for the inter-

vention and control groups, respectively. The variances of mˆ 1 and mˆ 2 are Var( mˆ 1 ) =

{1 + (m - 1)r }s 2 , mNr

Var( mˆ 2 ) =

{1 + (m - 1)r }s 2 . mN(1 - r )

and

Note that bˆ2 = mˆ 1 - mˆ 2 is an unbiased estimate of b 2 . The variance of bˆ2 is Var( bˆ2 ) = Var( mˆ 1 ) + Var( mˆ 2 ) =

{1 + (m - 1)r }s 2 , mNs r2

where s r2 = r (1 - r ) . The test statistic Z for testing H 0 : m1 = m2 against H1 : m1 ¹ m2 is Z=

mNs r ( mˆ 1 - mˆ 2 ) . s {1 + (m - 1)r }

Hence, we reject the hypothesis H 0 : m1 = m2 if the absolute value of Z is greater than z1-a /2 , where z1-a/2 is the 100(1 - a / 2) th percentile of a standard normal distribution. The required number of total clusters (N) to test the hypotheses H 0 : m1 = m2 versus H1 : m1 ¹ m2 with a two-sided type I error of a and power 1 - g is N=

s 2 {1 + (m - 1)r }(z1-a/2 + z1-g )2 . ms r2 ( m1 - m2 )2

(2.4)

The required number of clusters are Nr and N(1 - r ) for intervention and control groups, respectively. For a balanced design ( r = 0.5), the total number of clusters required is N=

4s 2 {1 + (m - 1)r }(z1-a/2 + z1-g )2 . m( m1 - m2 )2

As a special case of m = 1 in a balanced design, then the total number of subjects needed becomes

43

Cluster Randomized Trials

N subj = N = n1 + n2 =

4s 2 (z1-a/2 + z1-g )2 . ( m1 - m2 )2

As a special case of r = 0 in a balanced design, then the total number of subjects needed becomes N subj = mN =

4s 2 (z1-a/2 + z1-g )2 . ( m1 - m2 )2

The above statistical test and sample size estimate are for two-level data structures frequently in CRTs. The subjects nested within the groups would be considered as the frst level, and the organizational structure that forms the groups is the second level. Three-level data structures also arise frequently in real-world settings. One example of a trial with this structure is a large-scale pragmatic randomized trial to test information technology (IT) intervention to reduce hospitalization rates in patients with multiple chronic diseases. In this design, patients (level 1) are nested within clinics (level 2) which are nested within healthcare systems (level 3). Heo and Leon [21] provided the sample size formula for three-level cluster randomized trials.

2.3 Binary Outcomes Binary outcomes are frequently used in CRTs. For example, Abdulahi et al. [22] proposed a CRT with 1:1 allocation ratio. Zones in Ethiopia formed the unit of randomization for the trial, while mothers within the zones formed observation units. The trial’s primary objective is to determine if breastfeeding intervention is superior to usual care in improving the timely initiation of breastfeeding, exclusive breastfeeding, and growth. Abraham et al. [23] evaluated the effectiveness of two versions of a guideline and theory-based multicomponent intervention to reduce physical restraints in nursing homes. Nursing homes formed the unit of randomization for the trial, while residents within the nursing homes formed units of observation. The primary outcome was physical restraint prevalence at 12 months, assessed through direct observation by blinded investigators. Many CRTs aim to compare the proportions of individuals between intervention groups. The goal is to test H 0 : p1 = p2 against H1 : p1 ¹ p2 , where p1 and p2 are the proportions of the binary outcome in the intervention and control groups. In cluster-level analysis, a common approach is to test whether the difference in the average event rates is signifcantly different between intervention and control groups using a two-sample t-test, a Wilcoxon ranksum test, or a Fisher’s two-sample permutation test. Individual-level analysis

44

Design and Analysis of Pragmatic Trials

can be done using the adjusted chi-square test, ratio estimator approach, generalized linear mixed-effects model, and GEE approach. 2.3.1 Standard Pearson Chi-Square Test The Pearson chi-square test has been used to test the effect of intervention between intervention groups, assuming that sample observations are statistically independent. Let nk be the number of clusters in the kth intervention ( k = 1, 2 ). Let k = 1 and 2 denote the intervention and control groups, respectively. We assume that clusters are randomly assigned to one of two groups. Let Yijk be the binary random variable ( Yijk = 1 for a response, 0 for no response) of the jth ( j = 1, ˜ , mik ) subject in the ith ( i = 1,˜ , nk ) cluster of the kth intervention. Let mik be the cluster size of the ith cluster of the kth intervention, Mk =

å

nk

mik , and M =

i=1

å

2

Mk . Let Y++k =

k=1

å å nk

mik

i=1

j=1

Yijk .

The unbiased estimators of the response probabilities are given by nk

mik

ååY

ijk

pˆ k =

i=1 j=1 nk

=

åm

Y++k , Mk

k = 1, 2.

ik

i=1

An overall response probability is estimated by 2

nk

mik

ååå

2

Yijk

pˆ =

k=1 i=1 j=1 2 nk

++k

=

ååm

åY k=1

M

,

k = 1, 2.

ik

k=1 i=1

The Pearson chi-square test statistic is given by 2

c 2P =

å k=1

Mk ( pˆ k - pˆ )2 . pˆ (1 - pˆ )

The Pearson chi-square test assumes that observations are independent. In CRTs, the assumption of independence in the Pearson chi-square test is generally violated since the responses of subjects within a cluster are more similar than those taken on subjects from different clusters. Thus, it is more reasonable to assume that the intracluster correlation ( r ) is positive. The bias is given by the variance infation factor (VIF), {1 + (m - 1)r } , for clusters of fxed size m. Calculating a standard Pearson chi-square test statistic to test hypotheses and then dividing these statistics by the VIF produce an appropriate cluster-adjusted chi-square test statistic [1].

45

Cluster Randomized Trials

2.3.2 Adjusted Chi-Square Test Clustering can be quantifed by intracluster correlation ( r ), a measure of the similarity between individuals within a cluster for an outcome. Since r is unknown, the analysis of variance (ANOVA) estimate rˆ is used to estimate the intracluster correlation coeffcient for the binary outcome [1, 24]. An adjusted chi-square test statistic was proposed by making a simple adjustment to the Pearson chi-square test statistic to compare proportions estimated from clustered binary observations [25]. The correction factor is given by

Ck

å =

nk i=1

mik {1 + (mik - 1)rˆ }

å

nk i=1

mik

= 1 + (mAk - 1)rˆ ,

k = 1, 2.

The factor Ck can be considered as the estimated variance infation factor in group k. This adjustment approach assumes that C1 is not signifcantly different from C2 . The assumption will hold for experimental comparisons but could be violated for some observational comparisons. Jung et al. [26] and Ahn et al. [27] investigated the conditions under which the adjusted chisquare statistic is valid and examined its performance when these assumptions are violated. The simulation study showed that the adjusted chi-square statistic generally produced empirical type I errors close to the nominal level under the assumption of a common intracluster correlation coeffcient. Even if the intracluster correlations were different, the adjusted chi-square statistic performed well when the groups had equal numbers of clusters. The adjusted chi-square test statistic for testing H 0 : p1 = p2 against H1 : p1 ¹ p2 is given by 2

c 2A =

å k=1

Mk ( pˆ k - pˆ )2 , ˆ Ck pˆ (1 - p)

which follows a chi-square distribution with one degree of freedom. Hence, we reject the hypothesis H 0 : p1 = p2 if c 2A is greater than c12-a (1) , where c12-a (1) is the 100(1 - a ) th percentile of a chi-square distribution with one degree of freedom. A sample size formula was developed for a dichotomous outcome that accounts for varying cluster sizes since cluster sizes are not constant in many cluster randomization trials [6, 14, 18]. A correction factor was proposed to compensate for the loss of effciency and power due to varying cluster sizes in continuous and binary outcomes [28–31]. We wish to test the null hypothesis H 0 : p1 = p2 versus H1 : p1 ¹ p2 . The unbiased estimators of the response probabilities are given by

46

Design and Analysis of Pragmatic Trials

nk

mik

ååY

ijk

pˆ k =

i=1 j=1 nk

åm

, k = 1, 2.

ik

i=1

Conditional on the empirical distribution of the cluster size mik , the variance of pˆ k is estimated by

å V = pˆ (1 - pˆ ) k

k

k

nk i=1

mik {1 + (mik - 1)rˆ }

æ ç è

å

mik ö÷ i=1 ø nk

2

.

The test statistic for testing H 0 : p1 = p2 versus H1 : p1 ¹ p2 is Z=

pˆ 1 - pˆ 2 . Vˆ1 + Vˆ2

When the sample sizes n1 and n2 are suffciently large, the null hypothesis is rejected if the absolute value of Z is greater than z1-a /2 , the 100(1 - a / 2) percentile of the standard normal distribution. We discuss sample size calculation to test hypotheses H 0 : p1 = p2 versus H1 : p1 ¹ p2 with a two-sided type I error rate of a and power 1 - g . Let mik be independent and identically distributed with mean q and variance t2 . As nk ® ¥ , nk × Vk ® pk (1 - pk ){(1 - rˆ )

1 ˆ ˆ 2 + r + rx }, q

where x = t / q . With a two-sided signifcance level of a and power of 1 - g , the sample size for comparing two proportions is n = (z1-a /2 + z1-g )2 ×

p1(1 - p1 ) + p2 (1 - p2 ) 1 × {(1 - rˆ ) + rˆ + rˆx 2 }. 2 (p1 - p2 ) q

The above sample size estimate depends on the frst two moments of the marginal distribution of the cluster size, in addition to expected treatment difference, variance, intraclass correlation, and type I and type II error rates. In the case of unequal cluster size, a frequently used method for sample size estimate is to use the average cluster size. If we use average cluster size, that is, if we use q instead of mik , then the required number of clusters in each intervention group is

47

Cluster Randomized Trials

n = (z1-a /2 + z1-g )2 ×

p1(1 - p1 ) + p2 (1 - p2 ) 1 .×{(1 - rˆ ) + rˆ }. (p1 - p2 )2 q

Since rˆ is positive in most cases, the average cluster size method underestimates the sample size. The percentage increase in sample size ( Q ) from the average cluster size method for both continuous and binary outcomes [13, 18] is Q=

qrˆ x 2 ´100. 1 - rˆ + qrˆ

This sample size formula can be easily extended to accommodate unbalanced randomization. Suppose the intervention:control randomization ratio is r : 1 , the number of clusters required for the control group is + z )2 s 2 1 1 (z {(1 - rˆ ) + rˆ + rˆx 2 }. n2 = (1 + ) 1-a /2 1-g 2 (p1 - p2 ) q r The number of clusters required for the intervention group is n1 = r × n2 . It is noteworthy that when x = 0, i.e., no variability in cluster size so all mik = q, the required sample size becomes + z )2 s 2 1 1 (z {(1 - rˆ ) + rˆ }. n2* = (1 + ) 1-a /2 1-g 2 (p1 - p2 ) r q

2.3.3 Ratio Estimator Chi-Square Test Rao and Scott [32] proposed a simple method for comparing independent groups of clustered binary data, which is based on the concepts of “design effect” and “effective sample size” widely used in sample surveys. It assumes no specifc models for the intracluster correlation. The ratio estimator chisquare test [32] is a simple adjustment of the standard Pearson chi-square

å

nk

test. If the unit of allocation is an individual, VarB ( pˆ k ) = pˆ k (1 - pˆ k )/ mik . i=1 However, if the unit of allocation is a cluster, then VarB ( pˆ k ) tends to underestimate the true variance of pˆ k . The unbiased estimator of pˆ k is nk

mik

ååY

ijk

pˆ k =

i=1 j=1 nk

åm

ik

i=1

=

Y++k , Mk

k = 1, 2.

48

Design and Analysis of Pragmatic Trials

The unbiased estimator pˆ k can be written as the ratio of two-sample means, Y++k / nk and Mk / nk . Thus, for large nk , an estimator of the variance of pˆ k is obtained from the standard sample survey theory [33] as nk

VarR ( pˆ k ) = nk (nk - 1)-1(

nk

nk

æ ç ç è

2

ö Yijk - mik pˆ k ÷ = ÷ j=1 ø

mik

åm ) å å ik

-2

i=1

i=1

nk

åå i=1

2

æ mik ö ç Yijk - mik pˆ k ÷ ç j=1 ÷ è ø . (nk - 1)Mk2

As nk ® ¥ , Z = (pˆ k - pk )/ VarR ( pˆ k ) asymptotically follows a normal distribution with mean 0 and variance 1. The design effect is nk

VarR ( pˆ k ) dk = = VarB ( pˆ k )

åm Var (pˆ ) ik

i=1

R

k

pˆ k (1 - pˆ k )

=

MkVarR (pˆ k ) , pˆ k (1 - pˆ k )

which represents the variance infation due to clustering. In sample sur˜ k = Mk / dk is called “effective sample size”. The adjusted vey literature, M number of responses in the kth intervention is Y˜++k = Y++k / dk . Let 2 2 ˜ k = Y++k / Mk = pˆ k and p˜ = ˜ k . The one degree p˜ k = Y˜++k / M Y˜++k / M

å

k=1

of freedom ratio estimator chi-square statistic is 2

c 2R =

å k=1

˜ k ( p˜ k - p) ˜ 2 M = p˜ (1 - p˜ )

2

å k=1

å

k=1

˜ k ( pˆ k - p) ˜ 2 M . p˜ (1 - p˜ )

2 R

The ratio estimator statistic c is the same as the adjusted chi-square statis˜ and p˜ replacing M and p, respectively. The ratio estimator tic c 2A with M chi-square test does not require the assumption that the population design effects are the same in both groups. Thus, the ratio estimator is well-suited for both experimental and observational studies. 2.3.4 Generalized Estimating Equation Approach Let Yij be the binary response obtained from a subject j ( j = 1, ˜ , m ) in cluster i ( i = 1, ˜ , N ). We use ri = 1/0 to indicate that cluster i belongs to the intervention/control group, and r = E(ri ) is the proportion of clusters randomly assigned to the intervention group. We model Yij with the following logistic model: Yij  Bernoulli( pij ) and æ pij log çç è 1 - pij

ö ÷÷ = b1 + b 2ri , ø

j = 1, ˜ , m.

(2.5)

49

Cluster Randomized Trials

Here, b1 models the response on the log-odds scale for the control group, and b 2 is the log-odds ratio between intervention and control groups, representing the intervention effect. The primary interest is to test the null hypothesis H 0 : b 2 = 0 versus H1 : b 2 ¹ 0 . To facilitate derivation, we reparameterize Equation (2.5) as æ pij log çç è 1 - pij

ö ÷÷ = b1 + b2 (ri - r ), ø

(2.6)

where b1 = b1 + b 2 r and b2 = b 2 . Hence, testing b2 = 0 is equivalent to testing b 2 = 0. From Equation (2.6), we have pij =

e

bˆ’Zij

1+ e

bˆ’Zij

,

where b = (b1 , b2 )¢ and Zij = (1, ri - r )¢ . A GEE estimator bˆ is obtained by solving U n ( b) = 0 [20], where U n ( b) =

1 N

N

m

åå[y

ij

- pij (b)]Zij .

(2.7)

i=1 j=1

This equation is solved using the Newton–Raphson algorithm. At the lth iteration, bˆ ( l ) = bˆ ( l-1) + N -1/2 AN-1(bˆ (l-1) )U N (bˆ ( l-1) ), where N

AN ( b) = -N -1/2

m

åå i=1 j=1

¶U N ( b) 1 = ¶b N

N

m

æ1

ååp (1 - p ) èç r - r ij

ij

i=1 j=1

i

ri - r

ö ÷. (ri - r ) ø 2

The statistics Z = N (bˆ - b) approximately follows a normal distribution with mean 0 and variance V as N ® ¥ , and V can be consistently estimated ˆ , with by VN = AN-1( bˆ )S N ( bˆ )AN-1(b) 1 S N ( bˆ ) = N

N

æ ç ç è

åå i=1

Ä2

ö eˆ ij Zij ÷ . ÷ j=1 ø m

ˆ and c Ä2 = cc T for a vector c . We reject the null hypotheHere, eij = Yij - pij (b) sis H 0 : b2 = 0 if the absolute value of Nbˆ2 / sˆ 2 is greater than z1-a /2 , where sˆ 22 is the (2, 2)th element of VN and z1-a /2 is the 100( 1 - a /2)th percentile of the standard normal distribution.

50

Design and Analysis of Pragmatic Trials

We discuss sample size calculation to test hypothesis H 0 : b 2 = 0 versus H1 : b 2 = b 20 ¹ 0 with a two-sided type I error of a and power 1 - g . That is, we want to determine how many clusters are needed such that the trial can detect a treatment effect of b20 with a two-sided type I error of a and power 1 - g . Let A and S denote the limits of AN and S N , respectively. The true response rates in the control and intervention groups are p1 = e b1 /(1 + e b1 ) and p2 = e( b1 + b2 ) /(1 + e( b1 + b2 ) ) , respectively. Let q1 = 1 - p1 , q2 = 1 - p2 , and r jj¢ = corr(Yij , Yij¢ ) denote the within-subject correlation with rjj = 1 . As N ® ¥ , AN and S N converge to A and S , where -r ö æ1 2 ÷ + mrp2 q2 ç r ø è1- r

æ1 A = m(1 - r )p1q1 ç è -r

1- r ö ÷, (1 - r )2 ø

and m

S = (1 - r )p1q1

m

-r ö ÷ + rp2q2 r2 ø

æ1 r jj¢ ç è -r j¢=1

åå j=1

m

m

åår j=1 j¢=1

jj¢¢

æ1 ç è1- r

1- r

ö ÷, (1 - r ) ø 2

respectively. As N ® ¥ , VN converges to V = A -1SA -1 . The (2,2)th element of V = A -1SA -1 is m

s = 2 2

m

åår

t

jj¢

j=1 j¢=1 2 2 r 1 1 2 2

ms pqpq

=

t[1 + (m - 1)r ] , ms r2 p1q1 p2q2

where t = (1 - r )p1q1 + rp2q2 and s r2 = r (1 - r ) . Here, t is the pooled variance from the control and intervention groups. The required sample size for testing the hypothesis H 0 : b 2 = 0 versus H1 : b 2 = b 20 ¹ 0 is N=

s 22 (z1-a /2 + z1-g )2 t [1 + (m - 1)r ](z1-a /2 + z1-g )2 = . b 202 b 202 ms r2 p1q1 p2q2

Example: In a pragmatic IT-based colorectal cancer screening trial, physicians are randomly allocated to either a control (usual care) group or an IT-based intervention group. Based on the assignment of his or her physician, each patient will be assigned either to an IT-based intervention or to a control group. Physicians will be randomly allocated to either a control group or an IT-based intervention group at a 1:1 ratio. The primary outcome of the trial is participation in colorectal cancer testing (yes/no, 1 = participation in testing, 0 = nonparticipation). We anticipate about 20% of the control group will participate in testing based on pilot data. We assume that each physician recruits 23 patients (m = 23), and r = 0.02. We will assume that the IT-based intervention group will yield at least a

(2.8)

51

Cluster Randomized Trials

32% participation rate. We would like to estimate the number of clusters to test the hypothesis H 0 : p1 = p2 versus H1 : p1 ¹ p2 with a two-sided type I error of 0.05 and 90% power. That is, we want to determine how many clusters (here, physicians) are needed such that the trial can detect a treatment effect of 12% (= 32%−20%) with 80% power at a signifcance level of 0.05. Here, r = 0.5, p1 = 0.2, p2 = 0.32, s r2 = 0.25, and t = 0.189. Thus, b1 = log( p1 /(1 - p1 )) =-1.386, and b 2 = log( p2 /(1 - p2 )) - b1 = 0.633. From Equation (2.8),

N=

t [1 + (m - 1)r ](z1-a /2 + z1-g )2 = 36. b 202 ms r2 p1q1 p2q2

The number of clusters we need is N = 36 physicians. That is, 36 physicians will be randomly allocated to either a control group or an IT-based intervention group at a ratio of 1:1. 2.3.5 Generalized Linear Mixed Model Approach Let Yij be the binary response obtained from a subject j ( j = 1, ˜ , m ) in cluster i ( i = 1, ˜ , N ). We use ri = 1/0 to indicate that cluster i belongs to the intervention/control group, and r = E(ri ) is the proportion of subjects randomly assigned to the intervention group. We model Yij with a mixed-effect logistic regression model [4], æ pij log çç è 1 - pij

ö ÷÷ = b1 + b 2ri + ui + eij , ø

j = 1, ˜ , m.

(2.9)

Here, b1 models the response on the log-odds scale for the control group, b 2 is the log-odds ratio between intervention and control groups, representing the intervention effect, and pij = E(Yij ) is the probability of the response of the jth subject in the ith cluster. A random effect term for the ith cluster, ui , is distributed as N(0, s u2 ) , and a random effect for the jth subject in the ith cluster, eij , is distributed as N(0, s e2 ) . The null hypothesis to be tested is H 0 : b 2 = 0 in Equation (2.9). Sample size estimation for a trial with binary outcomes is essentially an extension of the method used for continuous outcomes. The sample size formula (2.4) for continuous outcomes was extended to determine the number of clusters, N, for testing H 0 : b 2 = 0 in Equation (2.3) to binary outcomes [4]. The required number of clusters (N) for a two-sided type I error of a and power of 1 - g is given by [34]: N=

1 [1 + (m - 1)r ]L2 , 1 - r m( p1 - p2 )2

52

Design and Analysis of Pragmatic Trials

where L = Z1-a/2 p(1 - p )/ r + Z1-g p1(1 - p1 ) + p2 (1 - p2 )(1 - r )/ r , and p = (1 - r )p1 + rp2 . The required numbers of clusters are N(1 - r ) and Nr for control and intervention groups, respectively. A positive intracluster correlation ( r ) increases the required sample size for a given power. For a balanced design ( r = 0.5), L becomes L = Z1-a /2 2p(1 - p ) + Z1-g p1(1 - p1 ) + p2 (1 - p2 ). Thus, the required number of total clusters is

N=

(

2[1 + (m - 1)r ] Z1-a /2 2p(1 - p ) + Z1-g p1(1 - p1 ) + p2 (1 - p2 ) m( p1 - p2 )2

)

2

. (2.10)

Example: In a pragmatic IT-based colorectal cancer screening trial, mixed-effects logistic regression models will be ftted to identify signifcant factors associated with screening with group assignment as the primary independent variable and clinic physician as a random cluster effect. We apply the sample size formula (2.10) to estimate the sample size using the balanced design with the Cancer Risk Intake System (CRIS) example in Section 2.3.4. Here, p1 = 0.2, p2 = 0.32, m = 23, r = 0.02, p = 0.26, a = 0.05, and power = 0.9. The required total number of clusters (N) is 36 physicians, with 18 physicians allocated to each of two study groups.

2.4 Count Outcomes Count outcomes frequently occur in CRTs that have been widely employed in medical and public health research. Many clinical outcomes are count outcomes, such as the number of falls in nursing homes, the number of clinic visits, and the number of hospitalizations. For example, Fairall et al. [35] conducted a pragmatic cluster randomized trial to evaluate an educational intervention aimed at improving the management of lung disease in adults attending primary care clinics with the number of clinic visits as an outcome. The unit of randomization was the clinic with the outcome data collected from individual patients. In particular, count outcomes are often used in infectious disease trials [36, 37]. Non-Bayesian approaches to the analysis of count outcomes are considered [38, 39]. For example, Young et al. [38] investigated the special case of correlated Poisson responses where exact relationships are available for a loglinear model with normal random effects. Generalized linear mixed model

53

Cluster Randomized Trials

(GLMM) and generalized estimating equation (GEE) have been frequently used for the analysis of CRTs. Bayesian methods were used for the analysis of count outcomes in CRTs [40]. Amatya et al. [41] proposed a sample size calculation method for CRTs with a count outcome based on the Poisson regression model. The model assumes the number of subjects (cluster size) to be equal across all clusters. Equal cluster sizes might be too restrictive, especially for pragmatic CRTs in real-world settings. The cluster sizes naturally vary due to practice scale, patient base, and logistics. Researchers have shown that ignoring cluster size variability in sample size calculation can lead to underpowered studies [4, 42, 43]. Wang et al. [44] proposed incorporating cluster size variability into sample size calculation for CRTs with count outcomes, where a correction term based on the coeffcient of variation in cluster size is included [13]. 2.4.1 Adjusted Normality Test A frequently used approach for the analysis of count data is to ascertain the number of events that occur relative to the total follow-up time. That is, an incidence rate that accounts for the variable follow-up time is frequently used if subjects are followed for varying lengths of time. Let l1 and l2 be the incidence rates in the intervention and control groups, respectively. The primary aim is to compare the incidence rates between the intervention and control groups. That is, the aim is to test the null hypothesis H 0 : l1 = l2 versus the alternative hypothesis H1 : l1 ¹ l2 . For simplicity, we assume that each subject is followed for the same amount of time with balanced randomization, i.e., n1 = n2 = n . Let Yijk denote the count outcome of the jth ( j = 1, ˜ , mik ) subject in the ith ( i = 1, ˜ , nk ) cluster of the kth intervention, where mik is the number of subjects in the ith cluster of the kth intervention, and nk is the number of clusters in the kth intervention. We assume a Poisson model for Yijk with parameter l k = E(Yijk ) =Var (Yijk ) . Yijk ’s are assumed to have a common intracluster correlation coeffcient, r =Corr (Yijk , Yij¢k ) for j ¹ j¢ [45, 46]. An unbiased estimator of l k can be expressed as mik

n

ååY

ijk

lˆ k =

i=1 j=1 n

.

åm

ik

i=1

Conditional on mik , lˆ k approximately follows a normal distribution with mean l k and variance

(

)

Var lˆ k |mik =

lk

å

n i=1

æ ç è

mik éë1 + ( mik - 1) r ùû

å

mik ÷ö i=1 ø n

2

.

54

Design and Analysis of Pragmatic Trials

The variance of lˆ k can be estimated by

2 k

s =

lˆ k

å

n i=1

æ ç è

mik éë1 + ( mik - 1) rˆ ùû

å

mik ÷ö i=1 ø n

2

.

We assume mik is independently and identically distributed with mean q = E(mik ) and variance t2 =Var (mik ) . Let x = t /q be the coeffcient of variation of cluster sizes. The variance of lˆ k is 1 l n[(1 - r )E(mik ) + r E(mik2 )] l k é Var(lˆ k ) = E é Var(lˆ k |mik )ù = k = (1 - r ) + r + rx2 ùú . ë û (nq )2 n êë q û

With a two-sided signifcance level of a and power of 1 - g , the required number of clusters in each intervention group is n=

2 ( z1-a /2 + z1-g ) ( l1 + l2 ) é 1 - r 1 + r + rx 2 ù . ) 2 êë( úû q ( l1 - l2 )

å å 2

(2.11)

n

The total number of subjects is M = mik and the total number of k=1 i=1 clusters is n1 + n2 = 2n. This sample size formula can be easily extended to accommodate unbalanced randomization. Suppose the intervention:control randomization ratio is r : 1 . Then, the number of clusters required for the control group is n=

2 ( z1-a /2 + z1-g ) ( l1 / r + l2 ) é 1 - r 1 + r + rx 2 ù . ) 2 êë( q ûú ( l1 - l2 )

(2.12)

It is noteworthy that when x = 0 , i.e., no variability in cluster size so all mik = q , the sample size formula becomes n

*

2 z1-a /2 + z1-g ) ( l1 / r + l2 ) é ( 1 = (1 - r ) + r ùú . 2 ê q ë û ( l1 - l2 )

Example: To illustrate the application of the proposed sample size estimation method in CRTs with count outcomes, we consider the CRT example provided in Amatya et al. [41]. The investigators were interested in assessing an educational intervention to improve lung disease management in adults attending South African primary care clinics. Forty clinics were randomized into intervention or control groups. The goal of enrollment at each clinic was 50 patients, and the outcome of interest was the number of clinic visits during the four-month study period. The

55

Cluster Randomized Trials

analysis reported that lˆ1 = e 1 = 4.35, lˆ2 = e 1 2 = 3.63, and rˆ = 0.32. Suppose we want to design a new balanced cluster randomization trial based on the estimated design parameters to achieve 90% power at a twosided signifcance level of 5%. Suppose there is no variability in cluster size (every clinic enrolls exactly 50 patients). In that case, the required number of clusters is 54 in each group with x = 0 from Equation (2.11) in the adjusted normality test. If the variability in cluster sizes is relatively small, say following a uniform distribution, U[40, 60], the required number of clusters is 55. Under a more considerable variability, say U[25, 75], the required number of clusters is 59. Finally, if more patients are expected to be enrolled at each clinic where the cluster size follows a distribution of U[70, 130], the required number of clusters is 55. bˆ

bˆ + bˆ

2.4.2 Ratio Estimator Method The aim is to test if the rate ratio, RR = l1 / l2 , between intervention and control groups is signifcantly different from 1. That is, the aim is to test H 0 : RR = 1 versus H1 : RR ¹ 1 . Statistical inferences for the observed rate ˆ ) can be obtained using the theory of ratio estimation [32]. The ratio ( RR mik mik observed rate is lˆik = Yijk / tijk , where tijk is the follow-up time of

å

j=1

å

j=1

the jth subject of the ith cluster in the kth intervention group. Let tik = and tk =

lˆk =

å

å

nk

nk

å

mik

j=1

tijk

tik . The mean observed rate in the kth intervention group is

i=1

ˆ = lˆ1 / lˆ2 . lˆik / nk . The observed rate ratio can be computed as RR

i=1

The estimated variance of the observed rate for the kth intervention group is given by [32, 47] ˆ lˆ k ) = nk Var( nk - 1

nk

tik2 ˆ (l ik - lˆ k )2 . 2

åt i=1

(2.13)

k

ˆ ) can be estimated by The variance of log(RR ˆ [log(RR ˆ )] = Var

2

å k=1

ˆ lˆ k ) Var( . lˆ 2k

The rate ratio estimator chi-square statistic for testing H 0 : RR = 1 is given by c R2 =

ˆ - 0]2 [log(RR) , ˆ ˆ Var[log(RR)]

(2.14)

which follows a chi-square distribution with one degree of freedom. We 2 reject the hypothesis H 0 : RR = 1 if c 2R is greater than c12-a (1) , where c1a (1)

56

Design and Analysis of Pragmatic Trials

is the 100(1 - a) th percentile of a chi-square distribution with one degree of ˆ is freedom. The 95% confdence interval for RR

(

) (

)

æ exp log(RR) ö. ˆ - 1.96 Var[log(RR ˆ )]) , exp log(RR ˆ ) + 1.96 Var[log(RR)] Rˆ ç ÷ è ø

2.4.3 Generalized Estimating Equation Suppose that N clusters are randomized to the control or intervention group in a CRT. Let mi (i = 1,˜, N ) denote the cluster sizes in the ith cluster with mean q = E(mi ) and variance t 2 = Var(mi ). Let Yij be the count outcome measured on the jth subject from the ith cluster. We assume that Yij follows a Poisson distribution with mean l ij and the correlation structure is exchangeable since all subjects receive the same intervention within each cluster. We defne r = Corr(Yij , Yij¢ ) for j ¹ j¢ and r = 1 for j = j¢. Let ri = 0/1 denote that the ith cluster is randomized to the control/intervention group. A cluster is randomly assigned to the intervention group with probability r = E(ri ) . We assume log(l ij ) = b1 + b 2ri ,

(2.15)

where b 2 is the difference in marginal mean between the intervention and control groups on the log scale, representing the overall intervention effect. Equation (2.15) suggests that lij = li with li = exp( b1 + b 2ri ) . The hypotheses of interest are H 0 : b 2 = 0 vs H1 : b 2 ¹ 0 . Let Zij = (1, ri )¢ = Zi , b = ( b1 , b 2 )¢ , and Yi = (Yi1 , ˜ , Yimi )¢ be the cluster-specifc response vector with mean li ( b ) = [li1( b ), ˜ limi ( b )]¢ , where li ( b ) = exp(Zi¢b ). If cluster i is assigned to a control group, li = exp( b1 ). Otherwise, li = exp( b1 + b 2 ) . Utilizing the independent working correlation, the GEE estimator bˆ = ( bˆ1 , bˆ2 )¢ solves equation SN ( b ) , where N

SN ( b ) = N -1/2

åD W i¢

i

-1

[Yi - li ( b )] = 0,

(2.16)

i=1

¶li ( b ) is a mi x 2 matrix, and Wi is a mi x mi ¶b diagonal matrix with all diagonal elements being li ( b ) . Equation (2.16) can be solved using the Newton–Raphson algorithm. At the lth iteration, where li ( b ) = exp(Zi¢b ) , Di =

bˆ ( l ) = bˆ ( l -1) + N -1/2 AN-1( bˆ (l-1) )SN ( bˆ ( l-1) ), where

57

Cluster Randomized Trials

AN ( bˆ ) = N -1

mi

N

ååZ Z l (bˆ ). i

i¢ i

i=1 j=1

N ( bˆ - b ) approximately follows a normal distribution with mean 0 and variance S N = AN-1VN AN-1 [20], where VN ( bˆ ) = N -1

N

mi

mi

åååe e Z Z , ij ij¢

i



i=1 j=1 j¢=1

where e ij = Yij - exp(Z¢i bˆ ) . Let sˆ 22 be the (2,2)th element of S N . We reject H 0 : b 2 = 0 if the absolute value of N bˆ2 / sˆ 2 is greater than z1-a /2 , where z1-a /2 is the 100(1 - a / 2) th percentile of the standard normal distribution. Let A and V denote the limits of AN and VN as N ® ¥. Then, S N converges to S = A -1VA -1 . Let s 22 be the (2,2)th element of S . Given the true intervention effect b 2 = b 20 , the required number of total clusters to achieve a power 1 - g at a two-sided signifcance level of a is N=

s 22 (z1-a /2 + z1-g )2 . b 202

(2.17)

As N ® ¥ , AN converges to æ1 A = (1 - r )l1q ç è0

0ö æ1 ÷ + r l2q ç 0ø è1

1ö ÷. 1ø

As N ® ¥ , VN converges to æ V = Eç ç è

mi

mi

ååe e

ij ij¢

j=1 j¢=1

æ1 ç è ri

ri ö ö ÷÷. ri2 ø ÷ ø

With matrix algebra, the (2,2)th element of S = A -1VA -1 is

s 22 =

=

q + (q 2 + t 2 - q )r q + (q 2 + t 2 - q )r + r l2q 2 (1 - r )l1q 2

q + (q 2 + t 2 - q )r æ 1 1 + ç q2 r l r l2 (1 ) 1 è

ö ÷. ø

The closed-form sample size formula for the number of clusters is

58

Design and Analysis of Pragmatic Trials

é q + (q 2 + t 2 - q )r æ 1 1 öù 2 + ê ç ÷ú (z1-a/2 + z1-g ) 2 (1 ) q l l r r 1 2 è øû . N=ë b 202

(2.18)

For a fxed cluster size, q = m and t 2 = 0. The required total number of clusters is [1 + (m - 1)r ] æ 1 1 ö 2 + ÷ (z1-a/2 + z1- g ) ç m r (1 r ) l l 1 2 ø è . N= b 202

(2.19)

In summary, to compute a sample size using Equation (2.18), we need to specify the mean (q ) and variance ( t 2 ) of cluster sizes, the randomization probability r , the regression parameters ( b ), the ICC parameter ( r ), the detectable difference ( b 20 ), and pre-determined levels of type I error a and power 1 - g . Furthermore, the proposed sample size formulas accommodate IRTs as a special case, where every cluster is of size 1. This implies that q = 1 1 1 and t = 0. It is then straightforward to show that s 22 = + and r l r l2 (1 ) 1 the corresponding sample size estimate is æ 1 1 ö 2 + ÷ (z1-a/2 + z1-g ) ç (1 - r )l1 r l2 ø è N= . b 202

(2.20)

Li et al. [48] developed a closed-form sample size formula for CRTs with count outcomes. The pragmatic features incorporated into a sample size formula include arbitrary randomization ratio, overdispersion, random variability in cluster size, and unequal lengths of follow-up over which the count outcome is measured. Many clinical count outcomes, such as the number of clinic visits, exhibit excessive zero values. In the presence of zero infation, traditional sample size formulas for count data based on Poisson or negative binomial distribution may be inadequate. Zhou et al. [49] provided the sample size method for CRTs with zero-infated count outcomes using GEE. Li and Tong [50] developed closed-form sample size formulas for CRTs with count outcomes when the counts are right truncated. Example: We illustrate the sample size estimation using a GEE method with the example given in Section 4.1. If there is no variability in cluster size, a GEE method yields 55 (=[N/2]) clusters per group using Equation (2.18). Here, [x] is the least integer greater than or equal to x. The numbers of estimated clusters in each group are 56, 60, and 56 for U[40, 60], U[25, 75], and U[70, 130], respectively, when the GEE method is used for sample size estimation.

59

Cluster Randomized Trials

2.5 Cluster Size Determination for a Fixed Number of Clusters The number of available clusters is often limited in a CRT. In addition, the number of clusters cannot exceed a specifed maximum value due to practical considerations such as the fxed expense of maintaining staff to collect data and conduct an intervention in each cluster. For example, in GoodNEWS (Genes, Nutrition, Exercise, Wellness, and Spiritual Growth) cluster randomization trial [51], the number of clusters cannot exceed a specifed maximum value due to practical considerations such as the fxed expense of maintaining staff to collect data and conduct an intervention in each cluster. The sample size formula for the number of subjects per cluster was provided assuming the same number of subjects in each cluster (equal cluster size) when there is an upper limit in the number of clusters to be studied in a CRT [1, 52, 53]. For a fxed number of clusters per group, the power does not increase appreciably once the number of subjects per cluster exceeds 1/ r [54]. An increase in the number of subjects beyond 1/ r will tend to have only a modest effect on power. If r = 0.04, it is not worth enrolling more than 25 (= 1/ r ) subjects per cluster. Let nI be the number of subjects per group needed in IRTs. That is, nI is the number of clusters per intervention group that would be required if the cluster size (m) is equal to one. In CRTs with an equal cluster size (m), the number of required subjects per group can be obtained by multiplying nI by the variance infation factor [1 + (m - 1)r ] , where m is the cluster size. Let k be the number of clusters per group. Then the total number of subjects in each group is km, which is also nI [1 + (m - 1)r ] . Thus, the number of clusters in each group (k) can be written as k = nI [1 + (m - 1)r ]/ m . When the number of clusters per each intervention group is fxed by k * , the required number of subjects per cluster is m* =

(1 - r )nI . k * - r nI

(2.21)

The total number of subjects required for each intervention group is obtained by multiplying k * by m* . The number of subjects ( m* ) is obtained only when the denominator of Equation (2.21), k * - rnI is positive. That is, m* can be computed only when k * is greater than r nI . That is, for trials with a fxed number of clusters with equal cluster size, the CRT will be feasible if the number of clusters ( k * ) is greater than the product of the number of subjects required under IRTs and the estimated intracluster correlation ( r ). When the cluster size is the same across the clusters, the variance under CRTs will be infated by a factor of [ 1 + (m - 1)r ] from that under IRTs. Thus, the detectable difference ( dc ) under CRTs is given by the square root of the variance infation factor times the detectable difference ( di ) under IRTs. That is, dc = di 1 + (m - 1)r .

60

Design and Analysis of Pragmatic Trials

1 ˆ ˆ 2 + r + rx ] for q unequal cluster sizes. Here, q is the mean cluster size, and x is the coeffcient of variation of cluster sizes. Suppose we have the estimate for the coeffcient of variation of cluster sizes from a pilot trial or from literature. In that case, we can estimate the average cluster sizes when the number of clusters is fxed. For varying cluster sizes, the total number of subjects required in each intervention group is Equation (2.2) shows the variance infation factor [ (1 - rˆ )

k *q = nIq {(1 - rˆ )

1 ˆ ˆ 2 + r + rx }. q

From the above equation, the average number of subjects per cluster ( m ) is [53] m=

(1 - r )nI . k * - r nI (x 2 + 1)

If there is no variation in cluster sizes ( x = 0), the above equation becomes Equation (2.21). For unequal cluster sizes, the average number of subjects is obtained only when the denominator of Equation (2.5) is positive. That is, m can be computed only when k * is greater than r nI (x 2 + 1) . Since the number of clusters is usually small in community intervention trials, the use of critical values z1-a /2 and z1-g instead of critical values t1-a /2 and t1-g corresponding to the t-distribution underestimates the required sample size. Therefore, to adjust for underestimating the required sample size, one cluster per group may be added when the sample size is determined with a 5% signifcance level, and two clusters per group with a 1% level of signifcance [17]. Example: The investigators plan to conduct a trial in Hispanic communities in Dallas. From the example in Section 2.4.1, lˆ1 = 4.35 and lˆ2 = 3.63. Under an IRT, the required number of subjects to detect a difference in the rates between two groups is 162 subjects per group from Equation (2.11) with a type I error rate of 5% and a power of 90%. To account for the clustering of individuals within a primary care clinic, we assume r = 0.05. The number of available primary care clinics is limited to 20 per group ( k * = 20). For an equal cluster size design, the required number of subjects per clus(1 - r )nI ter is m* = * = 153.9/11.9 = 12.9. That is, we need 13 subjects in each k - r nI clinic. The total number of subjects required for each intervention group is 260. If the cluster size varies among clusters with the coeffcient of variation equal to 0.45 ( x = 0.45), the required number of subjects per cluster is (1 - r )nI m* = * = 153.9/10.3 = 15. That is, we need 15 subjects in each k - r nI (x 2 + 1) cluster. The total number of subjects required for each intervention group is 300.

Cluster Randomized Trials

61

References 1. A. Donner, and N. Klar. Design and Analysis of Cluster Randomization Trials in Health Research. Oxford University Press, 2000. 2. S. Eldridge, and S. Kerry. RA Practical Guide to Cluster Randomised Trials in Health Services Research. UK Wiley, 2012. 3. D.M. Murray. Design and Analysis of Group-Randomized Trials. Oxford University Press, 1998. 4. C. Ahn, M. Heo, and S. Zhang. Sample Size Calculations for Clustered and Longitudinal Outcomes in Clinical Research. CRC Press, 2014. 5. P.M. Fayers, M.S. Jordhoy, and S. Kaasa. Cluster-randomized trials. Palliative Medicine, 16(1):69–70, 2002. 6. S.M. Eldridge, D. Ashby, and S. Kerry. Sample size for cluster randomized trials: Effect of coeffcient of variation of cluster size and analysis method. International Journal of Epidemiology, 35(5):1292–1300, 2006. 7. J. Cornfeld. Randomization by group: A formal analysis. American Journal of Epidemiology, 108(2):100–102, 1978. 8. R.L. Wears. Advanced statistics: Statistical methods for analyzing cluster and cluster-randomized data. Academic Emergency Medicine, 9(4):330–341, 2002. 9. C.C. Quinn, M.D. Shardell, E.A. Terrin, M.L. Barr, S.H. Ballew, and A.L. GruberBaldini. Cluster-randomized trial of a mobile phone personalized behavioral intervention for blood glucose control. Diabetes Care, 34(9):1934–1942, 2011. 10. J. Frost, J.A. Lane, N. Cotterill, M. Fader, L. Hackshaw-McGeagh, H. Hashim, et al. Treating urinary symptoms in men in primary healthcare using non-pharmacological and non-surgical interventions (TRIUMPH) compared with usual care: Study protocol for a cluster randomised controlled trial. Trials, 20(1):546, 2019. 11. R.G. Victor, K. Lynch, N. Li, C. Blyler, E. Muhammad, J. Handler, et al. A cluster-randomized trial of blood-pressure reduction in black barbershops. New England Journal of Medicine, 378(14):1291–1301, 2018. 12. A. Donner, and N. Klar. Methods for comparing event rates in intervention studies when the unit of allocation is a cluster. American Journal of Epidemiology, 140(3):279–289, 1994. 13. A.K. Manatunga, M.G. Hudgens, and S. Chen. Sample size estimation in cluster randomized studies with varying cluster size. Biometrical Journal, 43(1):75–86, 2001. 14. G.J. van Breukelen, and M.J. Candel. Comment on effciency loss because of varying cluster size in cluster randomized trials is smaller than literature suggests. Statistics in Medicine, 31(4):397–400, 2001. 15. A. Donner, and J. Koval. A procedure for generating group sizes from a oneway classifcation with a specifed degree of imbalance. Biometrical Journal, 29(2):181–187, 1987. 16. O.J. Dunn, and V.A. Clark. Applied Statistics: Analysis of Variance and Regression, 2nd edition. John Wiley, 1987. 17. G.W. Snedecor, and W.G. Cochran. Statistical Methods, 8th edition. Iowa State University Press, 1989. 18. S.H. Kang, C. Ahn, and S.H. Jung. Sample size calculations for dichotomous outcomes in cluster randomization trials with varying cluster size. Drug Information Journal, 37(1):109–114, 2003.

62

Design and Analysis of Pragmatic Trials

19. S. Shea, R.S. Weinstock, J.A. Teresi, W. Palmas, J. Starren, J.J. Cimino, et al. A randomized trial comparing telemedicine case management with usual care in older, ethnically diverse, medically underserved patients with diabetes mellitus. Journal of the American Medical Informatics Association JAMIA, 13(1):40–51, 2006. 20. K. Liang, and S.L. Zeger. Longitudinal data analysis for discrete and continuous outcomes using generalized linear models. Biometrika, 84:3–32, 1986. 21. M. Heo, and A.C. Leon. Statistical power and sample size requirements for three-level hierarchical cluster randomized trials. Biometrics, 64(4):1256–1262, 2008. 22. M. Abdulahi, A. Fretheim, and J.H. Magnus. Effect of breastfeeding education and support intervention (bfesi) versus routine care on timely initiation and exclusive breastfeeding in southwest Ethiopia: Study protocol for a cluster randomized controlled trial. BMC Pediatrics, 18(1):313, 2018. 23. J. Abraham, R. Kupfer, A. Behncke, B. Berger-Höger, A. Icks, B. Haastert, et al. Implementation of a multicomponent intervention to prevent physical restraints in nursing homes (imprint): A pragmatic cluster randomized controlled trial. International Journal of Nursing Studies, 96:27–34, 2019. 24. M.S. Ridout, C.G.B. Demuetrio, and D. Firth. Estimating intraclass correlation for binary data. Biometrics, 55(1):137–148, 1999. 25. A. Donner, and A. Donald. The statistical analysis of multiple binary measurements. Journal of Chronic Diseases, 41:899–905, 1988. 26. S.H. Jung, C. Ahn, and A. Donner. Evaluation of an adjusted chi-square statistic as applied to observational studies involving clustered binary data. Statistics in Medicine, 20(14):2149–2161, 2001. 27. C. Ahn, S.H. Jung, and A. Donner. Application of an adjusted chi-square statistic to site-specifc data in observational dental studies. Journal of Clinical Peridontology, 29(1):79–82, 2002. 28. G. van Breukelen, and M. Candel. Comment on effciency loss because of varying cluster size in cluster randomized trials is smaller than literature suggests. Statistics in Medicine, 31(4):397–400, 2012. 29. G. van Breukelen, M. Candel, and M. Berger. Relative effciency of unequal versus equal cluster sizes in cluster randomized and multicentre trials. Statistics in Medicine, 26(13):2589–2603, 2007. 30. G. van Breukelen, M. Candel, and M. Berger. Relative effciency of unequal cluster sizes for variance component estimation in cluster randomized and multicentre trials. Statistical Methods in Medical Research, 17(4):439–458, 2008. 31. G. van Breukelen, and M. Candel. Effciency loss due to varying cluster sizes in cluster randomized trials and how to compensate for it: Comment on You et al. (2011). Clinical Trials, 9(1):125–125, 2012. 32. J.N.K. Rao, and A.J. Scott. A simple method for the analysis of clustered binary data. Biometrics, 48(2):577–585, 1992. 33. W.G. Cochran. Sampling Techniques, 3rd edition. Wiley, 1977. 34. P.J. Diggle, P. Heagerty, K.Y. Liang, and S.L. Zeger. Analysis of Longitudinal Data, 2nd edition. Oxford University Press, 2002. 35. L.R. Fairall, M. Zwarenstein, E.D. Bateman, M. Bachmann, C. Lombard, B.P. Majara, et al. Effect of educational outreach to nurses on tuberculosis case detection and primary care of respiratory illness: Pragmatic cluster randomised controlled trial. BMJ, 331(7519):750–754, 2005.

Cluster Randomized Trials

63

36. P. Kelly, M. Katubulushi, J. Todd, R. Banda, V. Yambayamba, M. Fwoloshi, et al. Micronutrient supplementation has limited effects on intestinal infectious disease and mortality in a Zambian population of mixed hiv status: A cluster randomized trial. American Journal of Clinical Nutrition, 88(4):1010–1017, 2008. 37. T.J. Sandora, M.C. Shih, and D.A. Goldmann. Reducing absenteeism from gastrointestinal and respiratory illness in elementary school students: A randomized, controlled trial of an infection-control intervention. Pediatrics, 121(6):e1555–e1562, 2008. 38. M.L. Young, J.S. Preisser, B.F. Qaqish, and M. Wolfson. Comparison of subjectspecifc and population averaged models for count data from cluster-unit intervention trials. Statistical Methods in Medical Research, 16(2):167–184, 2007. 39. S. Bennett, T. Parpia, R. Hayes, and S. Cousens. Methods for the analysis of incidence rates in cluster randomized trials. International Journal of Epidemiology, 31(4):839–846, 2002. 40. D.J. Spiegelhalter. Bayesian methods for cluster randomized trials with continuous responses. Statistics in Medicine, 20(3):435–452, 2001. 41. A. Amatya, D. Bhaumik, and R.D. Gibbons. Sample size determination for clustered count data. Statistics in Medicine, 32(24):4162–4179, 2013. 42. L. Guittet, P. Ravaud, and B. Giraudeau. Planning a cluster randomized trial with unequal cluster sizes: Practical issues involving continuous outcomes. BMC Medical Research Methodology, 6(1):1, 2006. 43. X. Xu, H. Zhu, and C. Ahn. Sample size considerations for stratifed cluster randomization design with binary outcomes and varying cluster size. Statistics in Medicine, 38(18):3395–3404, 2019. 44. J. Wang, S. Zhang, and C. Ahn. Sample size calculation for count outcomes in cluster randomization trials with varying cluster sizes. Communications in Statistics-Theory and Methods, 49(1):116–124, 2020. 45. C. Rutterford, A. Copas, and S. Eldridge. Methods for sample size determination in cluster randomized trials. International Journal of Epidemiology, 44(3):1057– 1067, 2015. 46. G. Zou, and A. Donner. Confdence interval estimation of the intraclass correlation coeffcient for binary outcome data. Biometrics, 60(3):807–811, 2004. 47. C. Ahn, and J. Lee. A computer program for the analysis of over-dispersed counts and proportions. Comouter Methods and Programs in Biomedicine, 52(3):195–202, 1997. 48. D. Li, S. Zhang, and J. Cao. Incorporating pragmatic features into power analysis for cluster randomized trials with a count outcome. Statistics in Medicine, 39(27):4037–4050, 2020. 49. Z. Zhou, D. Li, and S. Zhang. Sample size calculation for cluster randomized trials with zero-infated count outcomes. Statistics in Medicine, 41(12):2191–2204, 2022. 50. F. Li, and G. Tong. Sample size and power considerations for cluster randomized trials with count outcomes subject to right truncation. Biometrical Journal, 63(5):1052–1071, 2021. 51. M.J. DeHaven, M.A. Ramos-Roman, N. Gimpel, J. Carson, J. DeLemos, S. Pickens, et al. The goodnews (genes, nutrition, exercise, wellness, and spiritual growth) trial: A community-based participatory research (CBPR) trial with African-American church congregations for reducing cardiovascular disease risk factors — Recruitment, measurement, and randomization. Contemporary Clinical Trials, 32(5):630–640, 2011.

64

Design and Analysis of Pragmatic Trials

52. C. Ahn, F. Hu, C.S. Skinner, and D. Ahn. Effect of imbalance and intracluster correlation coeffcient in cluster randomized trials with binary outcomes when the available number of clusters is fxed in advance. Contemporary Clinical Trials, 30(4):317–320, 2009. 53. K. Hemming, A.J. Girling, A.J. Sitch, J. Marsh, and R.J. Lilford. Sample size calculations for cluster randomised controlled trials with a fxed number of clusters. BMC Medical Research Methodology, 11:102, 2011. 54. A. Donner, and N. Klar. Pitfalls of and controversies in cluster randomization trials. American Journal of Public Health, 94(3):416–422, 2004.

3 Matched-Pair Cluster Randomized Design for Pragmatic Studies

3.1 Introduction of Matched-Pair Cluster Randomized Design For cluster randomized trials (CRTs), a practical limitation of simple cluster randomized design is imbalances in baseline characteristics (covariates) across treatment groups. Covariate imbalance leads to higher variance in the intervention effect and loss of effciency and would threaten the internal validity of the trials. This is because CRTs in general recruit a smaller number of clusters that are heterogeneous in baseline covariates, different from individually randomized trials. As a result, several design and analysis methods to account for baseline covariate imbalance, including targeted individual or cluster selection and analysis of covariate adjustments, have been proposed [1, 2]. Allocation methods such as matching, stratifcation, and covariateconstrained randomization can be adopted to mitigate the problem of baseline covariate imbalance [3]. Specifcally, matched-pair randomized design is widely implemented in CRTs to improve the similarity in clusters across treatment groups. In matched-pair CRTs, clusters are matched or paired on basis of characteristics such as cluster size, geographic location, or other features potentially related to the primary outcome measure. One cluster from a pair is allocated to the experimental treatment and the other to the control. One example of implementation of matched-pair cluster randomized design is a study on the effectiveness of an intervention package in reducing perinatal mortality, using pairs of control and intervention clinics [4]. A study evaluating the Mexican Universal Health Insurance Program showed that matched-pair cluster randomized design improved bias, effciency, power, and robustness in CRTs [5, 6]. This chapter considers matched-pair cluster randomized design, where matching can increase power by reducing study population heterogeneity and can guarantee balance on selected confounders by matching on them. A unique feature of CRTs is that outcomes within the same cluster are more similar than those from different clusters. To quantify the similarity within

DOI: 10.1201/9781003126010-3

65

66

Design and Analysis of Pragmatic Trials

a cluster, the intraclass correlation coeffcient is used to measure the correlation between any two individuals in the same cluster [7]. In matched-pair CRTs, there are three types of intraclass correlations we should consider: the within-treatment, the inter-treatment, and the within-subject correlations [8, 9]. When matching only occurs at the cluster level as in most of the matchedpair CRTs, the within-subject correlation is no longer required, because different individuals from the matched clusters are assigned to different treatments. The intraclass correlations must be accounted for in sample size planning and subsequent data analysis. Failure to do so may result in infated type I errors, leading to invalid conclusions. Another important consideration regarding the design and analysis of CRTs is how to adequately handle missing data that occur commonly in pragmatic studies. Missing data in such studies reduce statistical power and precision, dilute randomization, compromise matching in matched-pair CRTs, and consequently increase bias. Statistical methods for the analysis of CRTs with missing data have been extensively investigated. Under appropriate missing data assumptions, methods based on the generalized estimating equation (GEE) or augmented GEE, multiple imputation, and inverse probability weighting can provide unbiased estimates of intervention effects [10, 11]. Conventional approaches to account for missing data in the sample size determination of CRTs are to divide the sample size by the expected follow-up rate or to use the expected average cluster size in the calculation. Unfortunately, such crude adjustment approaches may fail to incorporate the impact of missing data on sample size and power, which largely depends on missing magnitude, missing pattern, and intraclass correlations. Therefore, the estimated required sample size can be highly biased, when the cluster follow-up rate is highly variable and the intraclass correlation or cluster size is large [7, 12]. The generalized linear mixed model (GLMM) and GEE are two popular approaches for the design and analysis of CRTs. The GLMM approach uses random effects to account for correlation among individuals in the same cluster, and model parameters are estimated by the maximum likelihood or restricted maximum likelihood method [13]. However, the GLMM approach may be sensitive to model misspecifcation. The GEE approach is widely employed to estimate marginal intervention effects in clustered and longitudinal studies due to its robustness against misspecifed correlation structure and ability to accommodate missing data. In this chapter, we discuss sample size estimations for matched-pair cluster randomized design for pragmatic studies, based on the GEE approach. Design considerations for pragmatic matched-pair CRTs are discussed in Section 3.2. Sample size estimations for missing continuous and binary outcomes are presented in Sections 3.3 and 3.4, respectively. Specifcally, we focus on missing data among individuals within a cluster, as it is a more common problem. Withdrawal of entire clusters could be incorporated into the sample size calculation by the addition of one or two extra clusters per treatment group [7]. Section 3.5 includes further readings.

Matched-Pair Cluster Randomized Design for Pragmatic Studies

67

3.2 Considerations for Pragmatic Matched-Pair CRTs 3.2.1 Impact of Correlation Data from a matched-pair CRT present a complicated correlation structure. In the matched-pair CRT, matched clusters are randomized into two treatment groups: intervention and control. Let Yijk denote the outcome of study unit k in cluster j under treatment i, for i = 1, 2, j = 1,..., n, where j is a pair of matched clusters, and k = 1,..., m, where m is the cluster size. We consider a general correlation structure for the outcome data, including three intraclass correlations: (1) r1 = corr(Yijk , Yijk¢ ) for k ¹ k¢, the within-cluster correlation that describes the similarity between outcomes from different study units within the same cluster under the same treatment; (2) r2 = corr(Yijk , Yi¢jk ) for i ¹ i¢ , the within-matched individual (study unit) correlation that describes the similarity between outcomes from two matched individuals under different treatments; (3) r3 = corr(Yijk , Yi¢jk ¢ ) for i ¹ i¢ and k ¹ k¢, the within-matched cluster correlation that describes the similarity between outcomes from different study units within two matched clusters but under different treatments. The within-matched individual correlation r2 is required if matching occurs at the individual level. Whereas if matching only happens at the cluster level, the correlation structure would depend only on r1 and r3 , and the withinmatched individual correlation r2 would reduce to the within-matched cluster correlation r3 ( r2 = r3 ). Without loss of generality, we focus on a more complex correlation structure that involves three intraclass correlations ( r1 , r2 , r3 ). 3.2.2 Impact of Missing Data Missing data in CRTs can occur at both the level of the individual and the level of the cluster. In most cases, individuals within a cluster may have missing data due to loss to follow-ups, missed visits, or early dropouts, and thus observed cluster sizes would be varied. While occasionally, entire clusters may withdraw or not recruit any participant. Moreover, investigators often encounter similar withdrawal patterns within a cluster in the data collected. Therefore, data from pragmatic matched-pair CRTs present various missing magnitudes and missing patterns, which can be characterized by marginal and joint probabilities that an outcome measurement is observed. Let D ijk be an indicator variable, where D ijk = 1 if study unit k in cluster j has an outcome measurement under treatment i and 0 otherwise. Similar to the correlation structure for the outcome data, we defne correlations for the missing data mechanism: t 1 = corr(D ijk , D ijk ¢ ) for k ¹ k¢, t 2 = corr(D ijk , D i¢jk ) for i ¹ i¢ , and t 3 = corr(D ijk , D i¢jk ¢ ) for i ¹ i¢ and k ¹ k¢. If matching only occurs at the cluster level, t 2 would reduce to t 3 (t 2 = t 3 ). The impact of missing data on sample size and power in matched-pair CRTs depends on missing magnitude, missing pattern, and intraclass correlations for outcome data.

68

Design and Analysis of Pragmatic Trials

3.3 Matched-Pair Cluster Randomized Design with Missing Continuous Outcomes In this section, we discuss matched-pair cluster randomized design with continuous outcomes, in the presence of missing data. To tackle design issues in CRTs with missing data, Taljaard et al. [12] incorporated a design effect factor into addressing clustering and missing data in parallel CRTs and derived sample size formulas based on t-test or chi-square test. Roy et al. [14] used a linear mixed-effect regression model and an iterative method to examine sample size results at different follow-up rates in longitudinal cluster design. Mwandigha et al. [15] concluded that the power can be overestimated if right truncation is not accounted for with right-truncated Poisson distribution, and Li and Tong [16] developed sample size formulas for CRTs with count outcomes subject to right truncation, which can also accommodate missing outcomes. In this section, we present a closed-form sample size formula for matched-pair randomized design with missing continuous outcomes based on the GEE approach [17]. 3.3.1 Sample Size Estimation Based on GEE Approach Let X1jk = 1 denote the intervention for study unit k in the paired cluster j, and X 2 jk = 0 denote the control for study unit k in the paired cluster j. To make an inference on the intervention effect on the continuous outcome of study unit k in the paired cluster j under treatment i, Yijk , we assume a linear regression model, Yijk = b1 + b 2Xijk + e ijk ,

(3.1)

where b1 is the intercept, b 2 is the intervention effect, and e ijk is the random error with E(e ijk ) = 0 and Var(e ijk ) = s 2 . We discuss sample size estimation to test the null hypothesis H 0 : b 2 = 0 vs the alternative hypothesis H a : b 2 ¹ 0 with a power of 1 - g at a two-sided type I error of a . First, we consider the case with complete observations. Let Yj = ( Y1 j1 ,...Y1jm , Y2 j1 ,..., Y2 jm ) , T

a

2m ´1

vector

of

outcomes,

e j = ( e 1 j1 ,...e 1jm , e 2 j1 ,..., e 2 jm ) , a 2m ´1 vector of random errors, X j be a 2m ´ 2 design matrix, T

æ 1m Xj = ç è 1m

0m ö ÷, 1m ø

where 1m is an m ´1 vector of 1’s and 0 m is an m ´1 vector of 0’s and b = ( b1 , b 2 ) . Model (3.1) can be rewritten as Yj = X j b + e j . Under the model assumption, the true correlation matrix is T

Matched-Pair Cluster Randomized Design for Pragmatic Studies

æ (1 - r1 )Im + r1 Jm Rc = ç è ( r2 - r3 )Im + r3 Jm

69

( r2 - r3 )Im + r3 Jm ö ÷, (1 - r1 )Im + r1 Jm ø

where I is an identity matrix and J is a square matrix of 1’s. In the Appendix, we show that the correlation matrix Rc has four distinct eigenvalues,

l1 = 1 - r1 + r2 - r3 , l2 = 1 + (m - 1)r1 + r2 + (m - 1)r3 , l3 = 1 - r1 - r2 + r3 , l4 = 1 + (m - 1)r1 - r2 - (m - 1)r3 . The setting of ( r1 , r2 , r3 ) should satisfy min(l1 , l2 , l3 , l4 ) > 0 so that the correlation matrix Rc is positive defnite. Under the independent working corT relation structure, the GEE estimator bˆ = bˆ1 , bˆ2 is obtained by solving the

(

equation:

Sn ( b ) = n-1/2

)

å(Y - X b ) X = 0. j

T

j

j

j

The expression of bˆ is

bˆ =

æ Y jXj ç ç è

å j

T

-1

ö X Xj ÷ , ÷ j ø as bˆ1 = (nm)-1

å

which can -1 ˆ b 2 = (nm)

T j

åå

be simplifed Y1jk and j k (Y2 jk - Y1jk ). j k By Liang and Zeger [18], n-1/2 ( bˆ - b ) is approximately normal with mean 0 and the variance being consistently estimated by the sandwich variance estimator å n = An-1Vn An-1 , where

åå

An = n-1

åX X , T j

j

j

æ Vn = s 2 ç n-1 ç è

ö X jT Rc X j ÷ . ÷ ø

å j

The (2, 2)th element of å n is the robust variance of n1/2 bˆ2 , denoted as s 22 . We reject H 0 : b 2 = 0 if|n1/2 bˆ2 / s 2 |> z1-a /2 , where z1-a /2 is the 100(1 - a / 2)th percentile of the standard normal distribution. Given type I error a , power 1 - g , and the true value of the intervention effect b 2 , we can solve the sample size n from the equation æ |bˆ2 | ö Pç > z1-a/2 |H a ÷ = 1 - g. ç sˆ 2 / n ÷ è ø

70

Design and Analysis of Pragmatic Trials

Thus, the required total number of clusters per group with complete observations is ncomplete =

s 22 (z1-a/2 + z1-g )2 b 22

.

Next, we extend the GEE sample size approach to accommodate the potential occurrence of missing data. This extension requires the assumption of missing completely at random (MCAR) [19]. That is, the missing events are independent of both the observed or unobserved outcomes. Under MCAR, (D1jk , D 2 jk ) are independent of (Y1jk , Y2 jk ). In the presence of missing data, X j , Yj , e j can be rewritten as æ D1 j1 ç ç ˜ ç D1jm Xj = ç ç D 2 j1 ç ˜ ç ç D 2 jm è

0 ö æ Y1 j1D1 j1 ö æ e1 j1D1 j1 ö ÷ ç ÷ ç ÷ ˜ ÷ ˜ ˜ ç ÷ ç ÷ ç Y1 jm D1jm ÷ ç e1jm D1jm ÷ 0 ÷ ÷ , Yj = ç ÷ , ej = ç ÷. D 2 j1 ÷ ç Y2 j1D 2 j1 ÷ ç e 2 j1D 2 j1 ÷ ç ÷ ç ÷ ˜ ÷ ˜ ˜ ÷ ç ÷ ç ÷ ÷ ç ÷ ç ÷ D 2 jm ø è Y2 jm D 2 jm ø è e 2 jm D 2 jm ø

Assuming that the study units under the same treatment share the same missing proportion: E(D1jk = 1) = s1 and E(D 2 jk = 1) = s2 , we have E(D ijk D ijk ¢ ) = si2 + t 1siti , E(D ijk D i¢jk ) = s12 = s1s2 + t 2 s1t1s2t2 , E(D ijk D i¢jk ¢ ) = s1s2 + t 3 s1t1s2t2 , and E(D ijl D ijl ) = E(D ijl ), where ti = 1 - si . The missing correlation matrix can be written as æ (1 - t 1 )Im + t 1 Jm Rm = ç è (t 2 - t 3 )Im + t 3 Jm

(t 2 - t 3 )Im + t 3 Jm ö ÷, (1 - t 1 )Im + t 1 Jm ø

which has four distinct eigenvalues,

l1’ = 1 - t 1 + t 2 - t 3 , l2’ = 1 + (m - 1)t 1 + t 2 + (m - 1)t 3 , l3’ = 1 - t 1 - t 2 + t 3 , l4’ = 1 + (m - 1)t 1 - t 2 - (m - 1)t 3 . Similar to the correlations of outcome data, the setting of (t 1 , t 2 , t 3 ) should satisfy min(l1’, l2’, l3’, l4’) > 0 so that the correlation matrix Rm is positive defnite. As n ® ¥, æ 1 ç ms 1 n®¥ An-1 ¾¾¾ ®ç ç- 1 ç è ms1

ö ÷ æ V11 n®¥ ÷ , Vn ¾¾¾ ®s 2 ç 1 1 ÷ è V21 + ms1 ms2 ø÷ -

1 ms1

V12 ö ÷. V22 ø

71

Matched-Pair Cluster Randomized Design for Pragmatic Studies

where V11 = [s1 + s2 + r1(m - 1)(s12 + s22 + t1s1t1 + t1s2t2 ) + 2 r2 (s1s2 + t2 s1t1s2t2 ) +2 r3 (m - 1)(s1s2 + t3 s1t1s2t2 )]m, V12 = V21 = [s2 + r1(m - 1)(s22 + t1s2t2 ) + r2 (s1s2 + t2 s1t1s2t2 ) + r3 (m - 1)(s1s2 + t3 s1t1s2t2 )]m, V22 = ëé s2 + r1(m - 1)(s22 + t1s2t2 )ùû m. The (2, 2)th element of å n = An-1Vn An-1 converges to

s 22 =

+

s2 é 1 1 + - 2 r2 + 2(m - 1)( r1 - r3 ) m êë s1 s2

r1(m - 1)t1(s1t2 + s2t1 ) - 2[ r2t2 + r3 (m - 1)t3 ] s1t1s2t2 ù ú. s1s2 û

Thus, the required total number of clusters per group that accounts for potential incomplete observations is nGEE =

+

s 2 (z1-a/2 + z1-g )2 mb 22

é1 1 ê s + s - 2 r2 + 2(m - 1)( r1 - r3 ) 2 ë 1

r1(m - 1)t1(s1t2 + s2t1 ) - 2[ r2t2 + r3 (m - 1)t3 ] s1t1s2t2 s1s2

ù ú. û

(3.2)

The estimated total number of clusters per group n depends on the missing magnitude and pattern through parameters (s1 , s2 , t1 , t2 , t3 ), and intraclass correlations ( r1 , r2 , r3 ), in addition to the variance of error terms s 2 , the true intervention effect b 2 , the cluster size m, and type I and type II errors. A stronger within-subject correlation r2 or inter-treatment correlation r3 would lead to a smaller sample size, while a stronger within-treatment correlation r1 would lead to a larger sample size, with all the other factors fxed. The missing correlations have similar effects on the required number of clusters in that a stronger within-subject t2 or inter-treatment t3 (within-treatment t1) would lead to a smaller (larger) number of clusters. The closed-form sample size formula (3.2) is derived under general correlation structures for outcome data and missing data mechanisms. When matching only occurs at the cluster level, we can set r2 = r3 and t2 = t3 .

72

Design and Analysis of Pragmatic Trials

Some special cases: with complete observations (s1 = s2 = 1), the required total number of clusters per group is ncomplete =

s 2 (z1-a/2 + z1-g )2 [ 2 - 2 r2 + 2(m - 1)( r1 - r3 )] . mb 22

Under independent missing (t1 = t2 = t3 = 0), the sample size formula (3.2) reduces to * nGEE =

s 2 (z1-a/2 + z1-g )2 é 1 1 ù + - 2 r2 + 2(m - 1)( r1 - r3 )ú . 2 ê mb 2 ë s1 s2 û

When there is no missing data, within-subject correlation or inter-treatment correlation (s1 = s2 = 1, r2 = r3 = 0 ), the sample size formula becomes ncomplete =

2s 2 ( z1- a/2 + z1- g )2 {1 + (m - 1)r1 } , mb 22

which is the required number of clusters per group for parallel CRTs [20] and r1 represents the conventional intracluster correlation coeffcient (ICC) in such trials. 3.3.2 Relative Efficiency of GEE Approach vs Crude Adjustment In matched-pair CRTs, one common ad hoc approach to handling missing data is to use a crude adjustment. This approach is to divide the sample size estimated with complete observations ncomplete by q, where q is the proportion of study units with complete pairs of observations under individuallevel matching or the average cluster follow-up rate under cluster-level matching. First, we consider the individual-level matching-paired CRTs with more general correlation structures and compare the GEE approach with the crude adjustment. Under general missing and independent missing, q = s1s2 + t2 s1t1s2t2 and q* = s1s2 , respectively. Let R = nGEE / ncrude , the sample size ratio comparing the GEE approach with the crude adjustment under general missing, and R = ì1 1 r (m - 1)t1(s1t2 + s2t1 ) - 2[ r2t2 + r3 (m - 1)t3 ] s1t1s2t2 + - 2 r2 + 2(m - 1)( r1 - r3 ) + 1 ïï s1 s2 s1s2 í r r r 2 2 + 2(m 1)( ) 2 1 3 ï ïî

´(s1s2 + t2 s1t1s2t2 ).

ü ïï ý ï ïþ

Matched-Pair Cluster Randomized Design for Pragmatic Studies

73

* * Let R* = nGEE / ncrude , the sample size ratio comparing the GEE approach with the crude adjustment under independent missing, and

1 1 + - 2 r2 + 2(m - 1)( r1 - r3 ) ´ s1s2 . R* = s1 s2 2 - 2 r2 + 2(m - 1)( r1 - r3 ) The relative effciency of GEE approach vs crude adjustment can be measured by the sample size ratio. Xu et al. [17] numerically investigated how the relative effciency is affected by different types of intraclass correlation of outcome data. Figure 3.1 presents the plot of sample size ratio R(R* ) against the within-subject correlation r2 under individual-level matching. The fgure shows that the sample size ratio R(R* ) increases as the within-subject correlation r2 increases, with other design parameters fxed. Under independent missing, when r2 < r * = (s1 + s2 - 2)/ 2(s1s2 - 1) + (m - 1)( r1 - r3 ), the crude adjustment would lead to a larger sample size comparing with the GEE approach, whereas when r2 > r * , the crude adjustment would lead to a smaller sample size. A similar pattern is found under general missing. The cutoff point r ( r * ) for each combination is marked in Figure 3.1. With missing data, the GEE approach would lead to a more accurate sample size estimate comparing with the crude adjustment. Figure 3.2 presents the plot of sample size ratio R(R* ) against the within-treatment correlation r1 under individuallevel matching. It shows that the crude adjustment gives a larger sample size, and the sample size ratio R(R* ) decreases as within-treatment correlation r1 increases, with other design parameters being fxed. Next, we consider the cluster-level matched-paired CRTs where there are only the within-treatment correlation r1 and inter-treatment correlation r3 . The GEE sample size formula under cluster-level matching can be obtained by setting r2 = r3 and t2 = t3 in formula (2) under individuallevel matching. For the crude adjustment under cluster-level matching, q = q* = (s1 + s2 )/ 2. We have numerically examined the impact of intraclass correlation of outcome data on the relative effciency. Figure 3.3 presents the plot of sample size ratio R(R* ) against within-treatment correlation r1, under cluster-level matching and fxing the inter-treatment correlation at r3 = 0.015 . With other design parameters fxed, the sample size ratio R(R* ) decreases as the within-treatment correlation r1 increases, similar to Figure 3.2. The within-treatment correlation here represents ICC in conventional parallel CRTs, and for most CRTs ICCs are small (0.001 to 0.2) [21, 22]. When the within-treatment correlation is small-to-moderate as in most CRTs, the fgure suggests that the GEE approach would likely lead to a sample size saving (R or R* < 1).

74

Design and Analysis of Pragmatic Trials

FIGURE 3.1 A numerical study to explore the impact of within-subject correlation r2 on relative effciency with individual-level matching, under general missing (t1 = 0.3, t2 = 0.1, t3 = 0 ) and independent missing, for m = 10, different combinations of (s1 , s2 ), and ( r1 , r3 ) = (0.01, 0.005).

3.3.3 Adjustment of Inflated Type I Error The GEE sample size approach is developed based on the asymptotic properties of estimated parameters. In large samples, for example n > 40 independent clusters in CRTs, the sandwich variance estimator provides valid inference regardless of the correct specifcation of the correlation matrix for the outcome data. While the sandwich variance estimator tends to be biased downwards and may infate the type I error when the sample size is small,

Matched-Pair Cluster Randomized Design for Pragmatic Studies

75

FIGURE 3.2 A numerical study to explore the impact of within-treatment correlation r1 on relative effciency with individual-level matching, under general missing (t1 = 0.3, t2 = 0.1, t3 = 0 ) and independent missing, for m = 10, different combinations of (s1 , s2 ), and ( r2 , r3 ) = (0.5, 0.005) .

for example n < 40 in CRTs [21, 23]. Various bias-corrected variance estimators were proposed to address this issue [24]. In a numerical study, we have illustrated how such bias can be alleviated by using bias-corrected variance estimators, when the number of clusters is small. Figure 3.4 shows the corresponding empirical type I errors by using different variance estimators, for the scenario of ( r1 , r2 , r3 ) = (0.01, 0.3, 0.005) and m = 20 under general and independent missing patterns. In the fgure, LZ represents the uncorrected sandwich estimator of Liang and Zeger [18], KC and MD represent the

76

Design and Analysis of Pragmatic Trials

FIGURE 3.3 A numerical study to explore the impact of within-treatment correlation r1 on relative effciency with cluster-level matching, under general missing (t1 = 0.3, t2 = t3 = 0.1) and independent missing, for m = 10, different combinations of (s1 , s2 ), and r3 = 0.015.

bias-corrected variance estimators proposed by Kauermann and Carroll [25] and Mancl and DeRouen [26], respectively. With the same generated dataset, the variances of these methods are LZ < KC < MD [24], thus KC tends to give a moderate adjustment and MD provides more conservative results. The empirical type I error based on KC is lower than that based on LZ but is still slightly higher than the nominal level, while MD has a better performance in controlling the type I error. The difference in type I errors among these three methods becomes smaller, with an increased number of clusters.

Matched-Pair Cluster Randomized Design for Pragmatic Studies

77

3.3.4 Sensitivity Analysis The sample size estimate and statistical power depend on missing pattern, missing proportion, and intraclass correlations. In absence of such information, a sample size re-estimation in the middle of the trial is recommended based on observed data or a sensitivity analysis should be conducted with a range of design parameter values. In numerical studies, we have conducted sensitivity analyses to investigate how the misspecifed missing pattern and correlation structure of outcome data affect the empirical power and required number of clusters per group. For misspecifed missing pattern, the required number of clusters per group is estimated using the formula under independent missing, while the data are generated under general missing with missing correlations (t1 , t2 , t3 ) = (0.3, 0.1, 0) or (0.5, 0.2, 0) and non-missing proportions (s1 , s2 ) = (0.85, 0.85) or (s1 , s2 ) = (0.5, 0.6). Figure 3.5 represents a plot of the empirical power under misspecifed missing patterns and true missing patterns for different scenarios by the GEE approach. Figure 3.5 shows that with increased missing proportion, the effect of misspecifed missing pattern becomes bigger, especially for the cluster size m = 20, and bigger missing correlations would also enlarge the effect. For misspecifed correlation structure of outcome data, the sample size is estimated with intraclass correlations of outcome data ( r1 , r2 , r3 ) = (0.01, 0.005, 0.005), where the data are generated with ( r1 , r2 , r3 ) = (0.05, 0.025, 0.025) or (0.03,0.015,0.015), under both independent missing and general missing (t1 , t2 , t3 ) = (0.3, 0.1, 0.1) with (s1, s2) =(0.85,0.85). Figure 3.6 represents a plot of the empirical power.

FIGURE 3.4 Empirical type I errors by using three variance estimators. KC, KC-corrected sandwich variance; LZ, uncorrected sandwich variance; MD, MD-corrected sandwich variance.

78

Design and Analysis of Pragmatic Trials

FIGURE 3.5 Empirical power under misspecifed missing pattern and true missing pattern by the GEE approach.

It shows that the empirical power is lower than the nominal level with misspecifed correlation structures, and the impact of misspecifed correlation structure becomes bigger with stronger correlations. These results suggest that the missing pattern, missing proportion, and correlation structure of outcome data substantially infuence the sample size estimate and power. 3.3.5 Example A matched-pair CRT of a school-based health promotion intervention for improving physical ftness in adolescents was reported by Andrade et al. [27] In this trial, ten pairs of schools were enrolled with an average school size of 72, and the schools were randomly assigned to either the intervention group or the control group. The students’ withdrawal/missing rates were 21.4% and 28.0% for intervention group and control group, respectively. One primary outcome was the speed shuttle run time, and the average (standard deviation) time was 1.89 (2.09) and 2.69 (3.44) seconds for the intervention group and control group, respectively. The intervention effect and the variance of random error were estimated as bˆ2 = -0.72 and sˆ 2 = 5, respectively. The within-treatment correlation was estimated as rˆ1 = 0.15. Suppose that investigators would like to conduct a similar matched-pair CRT to detect whether there is a signifcant difference in the speed shuttle run time under

Matched-Pair Cluster Randomized Design for Pragmatic Studies

79

FIGURE 3.6 Empirical power under misspecifed correlation structure.

a new intervention comparing with a control. Based on the preliminary data, we specify the design factors as m = 72, b 2 = -0.72, s 2 = 5, r1 = 0.15, and nonmissing proportions s1 = 78.6% and s2 = 72.0%. The inter-treatment correlation was not reported in the previous trial, thus as a common choice, we assume the inter-treatment correlation as half of the within-treatment correlation, r3 = 0.075. Also, since matching would be at the cluster/school level, we set r2 = r3 = 0.075 and t2 = t3 . We consider two missing patterns: general

80

Design and Analysis of Pragmatic Trials

missing with (t1 , t3 ) = (0.3, 0.1) and independent missing. The required number of clusters per group by the GEE approach is 16 to achieve at least 80% power to detect an intervention effect of b 2 = -0.72, at a two-sided 5% signifcance level, under general missing with (t1 , t3 ) = (0.3, 0.1). The required number of clusters per group decreases to 14 under independent missing. Moreover, the number of clusters per group by the crude adjustment is 18. The results suggest that the GEE approach would lead to a saving in sample size compared with the crude adjustment.

3.4 Matched-Pair Cluster Randomized Design with Missing Binary Outcomes In this section, we discuss matched-pair cluster randomized design with missing binary outcomes and present a closed-form sample size formula based on the GEE approach [28]. 3.4.1 Sample Size Estimation Based on GEE Approach Let Yijk denote the binary outcome of study unit k in cluster j under treatment i, for i = 1, 2, j = 1,..., n, where j is a pair of matched clusters, and k = 1,..., m, where m is the cluster size. Let X1jk = 0 denote the control for study unit k in cluster j, and X 2 jk = 1 denote the intervention for study unit k in cluster j. To make an inference on the intervention effect on Yijk , we assume the following logistic regression model: Yijk  Bernoulli (pijk ) and æ pijk ö logit{Pr(Yijk = 1)} = log çç (3.3) ÷÷ = b1 + b 2Xijk , è 1 - pijk ø where b1 is the log-transformed odds for the control group, and b 2 is the log-transformed odds ratio between the intervention and control groups, representing the intervention effect on the outcome. The interest is to test the null hypothesis H 0 : b 2 = 0, accounting for the missing patterns and the correlation structure, with a power of 1 - g at a two-sided signifcance level of a . Let b = ( b1 , b 2 )T and Xijk = (1, Xijk )T . Then, model (3.3) can be written as pijk =

e

b Xijk

1+ e

b Xijk

.

Let Yj = ( Y1 j1 ,...Y1jm , Y2 j1 ,..., Y2 jm ) , a 2m ´1 vector of outcomes. The binary outcome data have the true correlation matrix as follows: T

81

Matched-Pair Cluster Randomized Design for Pragmatic Studies

æ (1 - r1 )Im + r1 Jm Rc = ç è ( r2 - r3 )Im + r3 Jm

( r2 - r3 )Im + r3 Jm ö ÷, (1 - r1 )Im + r1 Jm ø

where I is an identity matrix and J is a square matrix of 1’s. This correlation matrix is the same as the one for the continuous outcome data. As described in Section 3.1, the setting of ( r1 , r2 , r3 ) should satisfy min(l1 , l2 , l3 , l4 ) > 0 so that the correlation matrix Rc is positive defnite, where (l1 , l2 , l3 , l4 ) are the four distinct eigenvalues of Rc . First, we discuss the case with complete observations. Under the independent working correlation structure, the GEE estimator bˆ is obtained by solving the equation:

2

n

Sn ( b ) = n-1/2

m

ååå{Y

ijk

- pijk ( b )} Xijk = 0.

j=1 i=1 k=1

The equation is solved by the Newton–Raphson algorithm. At the tth iteration,

(

) (

)

bˆ (t ) = bˆ (t-1) + n-1/2 An-1 bˆ (t-1) Sn bˆ (t-1) , where An ( b ) = -n

-

1 2

¶Sn ( b ) = n-1 ¶b

2

n

m

åååp

ijk

j=1 i=1 k=1

æ 1 (1 - pijk ) ç è Xijkk

Xijk ö 2 ÷. Xijk ø

By Liang and Zeger [18], n-1/2 ( bˆ - b ) is approximately normal with mean 0 and the variance is consistently estimated by å n = An-1 bˆ Vn bˆ An-1 bˆ , where

( ) ( ) ( )

( )

Vn bˆ = n-1

n

æ ç ç è

2

ö eˆ ijk Xijk ÷ ÷ k=1 ø m

å åå j=1

i=1

( )

Ä2

,

eˆijk = Yijk - pijk bˆ , and cÄ2 = ccT for a vector c. The (2, 2)th element of å n is the robust variance of n1/2 bˆ2 , denoted as sˆ 22 . We reject H 0 : b 2 = 0 if |n1/2 bˆ2 / sˆ 2 |> z1-a /2 , where z1-a /2 is the 100(1 - a / 2)th percentile of the standard normal distribution. Given type I error a , power 1 - g , and the true value of the intervention effect b 2 , the sample size n can be obtained by solving the equation - n b 2 / s 2 + z1-a /2 = -z1-g . Thus, the required total number of clusters per group with complete observations is ncomplete =

s 22 (z1-a/2 + z1-g )2 . b 22

82

Design and Analysis of Pragmatic Trials

Next, we incorporate missing data into the GEE sample size estimation. The sample size estimation is derived under MCAR: the missing events are independent of both observed and unobserved outcomes. Let D ijk be an indicator variable, where D ijk = 1 if study unit k in cluster j has an outcome measurement under treatment i and 0 otherwise. We assume that the study units under the same treatment share the same missing proportion: E(D1jk = 1) = s1 and E(D 2 jk = 1) = s2 . Let ti = 1 - si , i = 1, 2. The missing correlation matrix Rm for the binary outcome data is the same as the one for the continuous outcome data, and the setting of (t1 , t2 , t3 ) should satisfy min(l1’, l 2’, l 3’, l 4’) > 0 so that the correlation matrix Rm is positive defnite, where (l1’, l 2’, l 3’, l 4’) are the four distinct eigenvalues of Rm . With missing data, A n ( b ) and Vn bˆ become

( )

An ( b ) =

1 n

n

2

m

åååD

ijk

j=1 i=1 k=1

æ 1 × pijk ( b ) ( 1 - pijk ( b ) ) ç è Xijk

Xijk ö 2 ÷ Xijk kø

and

( )

1 Vn bˆ = n

n

æ ç ç è

2

ö D ijk eˆ ijk Xijk ÷ ÷ k=1 ø m

å åå j=1

i=1

Ä2

,

respectively. Denote the true success rates under the control and under the intervention as P1 = e b1 /(1 + e b1 ) and P2 = e b1 + b2 /(1 + e b1 + b2 ), respectively. Let Q1 = 1 - P1 and Q2 = 1 - P2 . The null hypothesis H 0 : b 2 = 0 is equivalent to H 0 : P1 = P2 . Theorem 1 As n ® ¥, å n = An-1 bˆ Vn bˆ An-1 bˆ ® å, and the (2,2)th element

( ) ( ) ( )

of å has a closed form

s 2*2 =

-

1 + (m - 1)(s1 + t1t1 )r1 1 + (m - 1)(s2 + t1t2 )r1 + ms1P1Q1 ms2P2Q2

2 P1P2Q1Q2 éë(s1s2 + t2 s1s2t1t2 )r2 + (m - 1)(s1s2 + t3 s1s2t1t2 )r3 ùû . ms1s2P1P2Q1Q2

The proof of Theorem 1 is presented in the Appendix. Given type I error a, power 1 - g, and the true value of the intervention effect b 2 , the required total number of clusters per group that accounts for potential incomplete observations is given by nGEE =

s 2*2 (z1-a/2 + z1-g )2 . b 22

Matched-Pair Cluster Randomized Design for Pragmatic Studies

83

The GEE sample size formula under cluster-level matching can be obtained by setting r2 = r3 and t2 = t3 in the above formula under individual-level matching. The estimated total number of clusters per group n depends on the missing magnitude and pattern through parameters (s1 , s2 , t1 , t2 , t3 ), the correlation combination ( r1 , r2 , r3 ), the cluster size m, success rates under the control and under the intervention (P1 , P2 ), and type I and type II errors. With all the other factors fxed, a higher within-subject correlation r2 or intertreatment correlation r3 would lead to a smaller sample size, while a higher within-treatment correlation r1 would lead to a larger sample size. The missing correlations have similar effects on the required number of clusters in that a higher within-subject t2 or inter-treatment t3 (within-treatment t1) correlation would lead to a smaller (larger) number of clusters. Some special cases: with complete observations (s1 = s2 = 1), the required total number of clusters per group can be estimated by

ncomplete =

1 r3 ] P1P2Q1Q2 ( z1-a/2 + z1-g )2 [1 + (m - 1)r1 ] (P1Q1 + P2Q2 ) - 2 [ r2 + (m - 1) ´ . 2 b2 mP1Q1P2Q2

Under independent missing where t1 = t2 = t3 = 0, the sample size is * nGEE =

( z1-a/2 + z1-g )2 1 + (m - 1)s1 r1 1 + (m - 1)s2 r1 2 [ r2 + (m - 1)r3 ] ´{ + }. ms1P1Q1 ms2P2Q2 b 22 m P1P2Q1Q2

3.4.2 Relative Efficiency of GEE Approach vs Crude Adjustment To compare the GEE approach with the crude adjustment, we seek to fnd the relative effciency measured by the sample size ratio R = nGEE / ncrude , similar to that for the continuous outcomes in Section 3.2. Let q be the proportion of study units with complete pairs of observations under individual-level matching or the average cluster follow-up rate under cluster-level matching. First, we consider the individual-level matching-paired CRTs with more general correlation structures. Under the general missing pattern, q = s1s2 + t2 s1t1s2t2 and the sample size ratio R is R=

é s1 + (m - 1)(s12 + t1s1t1 )r1 ùû P2Q2 éë s2 + (m - 1)(s22 + t1s2t2 )r1 ùû P1Q1 {ë + s12 s22 -

2 P1P2Q1Q2 ëé(s1s2 + t2 s1s2t1t2 ) + (m - 1)(s1s2 + t3 s1s2t1t2 )r3 ùû } s1s2

´

s1s2 + t2 s1s2t1t2 . Q P2Q2 ) - 2 [ r2 + (m - 1)r3 ] P1P2Q1Q2 1 (m 1) r (P + + [ 1] 1 1

84

Design and Analysis of Pragmatic Trials

Under independent missing pattern where t1 = t2 = t3 = 0, q* = s1s2 and the * * sample size ratio R* = nGEE / ncrude is

R* =

{

[1 + (m - 1)s1r1 ] P2Q2 + [1 + (m - 1)s2 r1 ] P1Q1 - 2

´

s1

s2

P1P2Q1Q2 [ r2 + (m - 1)r3 ]}

s1s2

[1 + (m - 1)r1 ] (P1Q1 + P2Q2 ) - 2 [ r2 + (m - 1)r3 ]

P1P2Q1Q2

.

Xu et al. [28] numerically examined the infuence of different types of intraclass correlations of outcome data on the relative effciency by plotting the sample size ratio against the correlations with a fxed cluster size m = 10 and success rates under the control and intervention (P1 , P2 ) = (0.1, 0.15). Figure 3.7 presents the plot of sample size ratio R(R* ) against within-subject correlation r2 , under individual-level matching. The fgure shows that the sample size ratio R(R* ) increases as the within-subject correlation r2 increases, with other design parameters fxed. Under both missing patterns, the GEE approach would lead to a smaller sample size compared with the crude adjustment, when r2 is less than a cutoff point, whereas when r2 is greater than the cutoff point, the crude adjustment would give a smaller sample size. The values of cutoff point r2 ( r2* ) for each combination are marked in Figure 3.7. Figure 3.8 presents the plot of the sample size ratio R(R* ) against within-treatment correlation r1, under cluster-level matching and fxing the inter-treatment correlation at r3 = 0.005. The sample size ratio R(R* ) decreases as the withintreatment correlation r1 increases, with other design parameters being fxed. 3.4.3 Adjustment of Inflated Type I Error and Sensitivity Analysis Bias-corrected variance estimators can be used to adjust for infated type I error, when the sample size is small, for example n < 40 clusters in CRTs. Figure 3.9 shows empirical type I errors with different variance estimators, for ( r1 , r2 , r3 ) = (0.03, 0.3, 0.005) and m = 20 under both missing patterns. LZ represents the unadjusted sandwich variance estimator by Liang and Zeger [18], KC represents the bias-corrected variance estimator by Kauermann and Carroll [25], and MD represents that by Mancl and DeRouen [26]. This fgure shows that the empirical type I error based on KC is lower than that based on LZ but still slightly higher than the nominal level, while that based on MD is closer to 5%. The impact of misspecifed missing pattern on the required number of clusters per group and empirical power can be assessed by a sensitivity analysis. For example, in a numerical study, the required number of clusters per group is estimated using the GEE formula under independent missing, while the data are generated under general missing

Matched-Pair Cluster Randomized Design for Pragmatic Studies

85

FIGURE 3.7 The plot of sample size ratio against within-subject correlation r2 with individual-level matching, under general missing (t1 = 0.3, t2 = 0.1, t3 = 0) and independent missing, for different combinations of (s1 , s2 ) and ( r1 , r3 ) = (0.03, 0.005) . The term “within-matched individual” here is the same as “within-subject”.

with missing correlations (t1 , t2 , t3 ) = (0.3, 0.1, 0) or (0.5, 0.2, 0) and non-missing proportions (s1 , s2 ) = (0.85, 0.85) or (s1 , s2 ) = (0.5, 0.6). Figure 3.10 presents the empirical power under misspecifed missing patterns and true missing pattern, for different scenarios by the GEE approach. It suggests that (i) the impact of misspecifed missing pattern becomes bigger as missing proportion increases, particularly for the cluster size m = 20, and (ii) the larger missing correlation would lead to a bigger bias.

86

Design and Analysis of Pragmatic Trials

A

Sample Size Ratio ˜R°

1.75 Combination 1: s1 = 0.8, s2 = 0.9 Combination 2: s1 = 0.7, s2 = 0.8 Combination 3: s1 = 0.6, s2 = 0.7 Combination 4: s1 = 0.5, s2 = 0.6

1.50

1.25

R=1

1.00 1=0.015

0.75

1=0.016 1=0.019 1=0.021

0.0

Sample Size Ratio ˜R*°

B

0.1

0.2

0.3

Independent Missing

1.75

Combination 1: s1 = 0.8, s2 = 0.9 Combination 2: s1 = 0.7, s2 = 0.8 Combination 3: s1 = 0.6, s2 = 0.7 Combination 4: s1 = 0.5, s2 = 0.6

1.50

1.25

R*=1

1.00 * 1=0.011 * 1=0.013 * 1=0.014 * 1=0.015

0.75

0.0

0.1

0.2

0.3

FIGURE 3.8 The plot of sample size ratio against within-treatment correlation ρ1 with cluster-level matching, under general missing (τ1 = 0.3, τ2 = τ3 = 0.1) and independent missing, for different combinations of (s1 , s2 ) and ρ3 = 0.005. The term “within-cluster” here is the same as “within-treatment”.

3.4.4 Example The Community Hypertension Assessment Trial (CHAT) investigated the effect of community pharmacy-based blood pressure (BP) monitoring sessions led by peer health educators, with feedback to family physicians (Intervention) against the usual practice model (Control), on the monitoring and management of BP among older adults [29]. The individual family practice (FP) clinic served as the unit of randomization with 14 matched pairs of

87

Matched-Pair Cluster Randomized Design for Pragmatic Studies

Empirical type I error

8%

7%

6%

5% LZ KC MD

4% 20

30

40

50

60

70

80

Number of Clusters

FIGURE 3.9 Empirical type 1 errors by using three variance estimators. KC, KC-corrected sandwich variance; LZ, uncorrected sandwich variance; MD, MD-corrected sandwich variance.

(a)

(b)

m = 10 , ( s1 , s2 ) = ( 0.85 , 0.85 )

85%

85%

80%

80%

75%

75%

True missing Misspecified missing ( Misspecified missing (

70% 40

60

80

1 1

, ,

2 2

, ,

3 3

100

) = ( 0.3 , 0.1 , 0 ) ) = ( 0.5 , 0.2 , 0 )

m = 20 , ( s1 , s2 ) = ( 0.85 , 0.85 )

True missing Misspecified missing ( Misspecified missing (

70% 20

120

40

m = 10 , ( s1 , s2 ) = ( 0.5 , 0.6 ) 85%

80%

80%

75%

75%

True missing Misspecified missing ( Misspecified missing (

85

105

Number of Clusters

, ,

2 2

, ,

3 3

80

60

) = ( 0.3 , 0.1 , 0 ) ) = ( 0.5 , 0.2 , 0 )

100

m = 20 , ( s1 , s2 ) = ( 0.5 , 0.6 )

85%

65

1

Number of Clusters

Number of Clusters

70%

1

125

1 1

, ,

2 2

, ,

3 3

) = ( 0.3 , 0.1 , 0 ) ) = ( 0.5 , 0.2 , 0 )

145

True missing Misspecified missing ( Misspecified missing (

70% 20

40

60

80

1 1

, ,

2 2

, ,

3 3

) = ( 0.3 , 0.1 , 0 ) ) = ( 0.5 , 0.2 , 0 )

100

Number of Clusters

FIGURE 3.10 Empirical power under misspecifed missing pattern and true missing pattern by the GEE approach.

88

Design and Analysis of Pragmatic Trials

control and intervention clinics composing the fnal sample. Fifty-fve eligible patients were randomly selected from each FP clinic. The primary outcome of the CHAT study was a binary outcome with “1” indicating the patient’s BP was controlled at the end of the trial and “0” otherwise. This matchedpair CRT yielded an estimated ICC (within-treatment correlation r1) of 0.077. Suppose that investigators would like to conduct a similar community intervention trial to examine the effect of a team-based approach including pharmacists (Intervention) against the usual practice model (Control) in controlling the patient’s BP. The trial will use a matched-pair cluster randomized design with FP as the unit of randomization. Fifty eligible patients (m = 50) will be randomly selected from each FP clinic. Based on the preliminary data from CHAT, we assume the within-treatment correlation r1 to be 0.077. Since the inter-treatment correlation was not reported in CHAT, as a common choice, we assume the inter-treatment correlation as half of the withintreatment correlation, r3 = 0.038. Matching would occur at the cluster/clinic level, thus we set r2 = r3 and t 2 = t 3 . Two missing patterns will be considered: independent missing and general missing. We assume the BP control rate of 83.4% at baseline in both intervention and usual care groups. We expect that the BP control rate will not change in the usual care group and will improve to 95.0% in the intervention group at the end of the trial. To detect a difference of P2 - P1 = 95.0% - 83.4% = 11.6% at the end of the trial with 90% power and 5% type I error, the required numbers of FP per group are 10, 11, and 12, under independent missing with the missing proportions of 0%, 20%, and 30%, respectively. The required numbers of FP per group are 10, 12, and 13, under general missing (t1 = 0.3, t3 = 0.1) with the missing proportions of 0%, 20%, and 30%, respectively. Moreover, the numbers of FP per group estimated by the crude adjustment are 13 and 15, with the missing proportions of 20% and 30%, respectively. The result suggests that the required sample size depends on the missing pattern and missing proportion, and the GEE approach would lead to a sample size saving as compared with the crude adjustment.

3.5 Further Readings In this chapter, we have discussed sample size estimations for matchedpair cluster randomized design for pragmatic studies. Various correlation structures and missing data are frequently encountered in pragmatic matched-pair CRTs. Sample size estimation must take into account the impact of correlation and missing data. We present closed-form sample size formulas for matched-pair cluster randomized design with continuous and binary outcomes, based on the GEE approach by treating incomplete observations as missing data in marginal (generalized) linear models. Different types of intraclass correlation, missing patterns, and missing proportions can be

Matched-Pair Cluster Randomized Design for Pragmatic Studies

89

accounted for with appropriate settings of design parameters to estimate the required number of clusters per group. In the presence of missing data, the GEE approach provides extra fexibility and improved accuracy as compared to the crude adjustment. The closed-form formulas allow one to theoretically derive the condition under which the GEE approach is superior to the crude adjustment, in terms of the relative effciency measured by the sample size ratio. Bias-corrected variance estimators can be used to address the issue of infated type I error when the number of clusters per group is small. The sample size estimate and statistical power depend on intraclass correlations, missing pattern, and missing proportion. Therefore, it is recommended that the estimates for intraclass correlations and missing proportion be reported in published matched-pair CRTs, in order to aid in planning future trials. When information concerning the appropriate correlation parameters, missing patterns, and missing proportions is absent, a sensitivity analysis should be conducted for sample size and power by varying values of design parameters within a plausible range. The sample size formulas are derived under MCAR. In practice, missingness could be related to observed or unobserved characteristics of individuals or clusters in the trial. When the missing data mechanism is informative, model-based methods, such as weighted GEE under missing at random [30], or simulation studies under specifc missing data mechanisms could be used to determine the sample size. Since the missing data mechanism is often uncertain, sensitivity analyses should be conducted to formally assess the robustness of plausible missing data mechanisms. A practical strategy is to frst estimate the sample size under MCAR and then conduct sensitivity analyses to evaluate its performance under various non-MCAR missing mechanisms. If the sample size estimate is highly sensitive to the departure from MCAR, custom-designed simulation studies should be conducted for specifc trials. Equal cluster size is assumed in the GEE sample size approach discussed in this chapter, as this assumption simplifes the derivation of the rather complicated sample size formulas. Unequal cluster sizes occur commonly in real-world pragmatic CRTs. Many researchers have investigated on how to incorporate varying cluster size into sample size estimation by using the coeffcient of variation of cluster size in different types of cluster randomized design [31–33]. A similar approach can be used to handle varying cluster size in matched-pair cluster randomized design with and without missing outcomes. Moreover, the working independence assumption is employed here, because when only limited information about the actual correlation structure is available, the working independence correlation is often used due to its computational convenience. Nevertheless, there are limitations in the working independence assumption: (i) the effciency loss by using a working independence correlation structure instead of the true one; and (ii) intraclass correlations are considered as nuisance parameters and cannot be estimated in the GEE analysis with a working independence correlation structure. However, the extension to a general working correlation structure requires

90

Design and Analysis of Pragmatic Trials

a specifcation of joint missing probabilities, which would make sample size formulas even more complicated, and using a sophisticated working correlation structure with estimation of nuisance parameters might also cause convergence issues in the estimation of the parameters of interest. Pre–post cluster experimental design is a popular type of design for pragmatic clinical trials, which includes the baseline measurement of the outcome of interest [7]. In such a design, each cluster serves as its own control and the pre–post intervention effect is evaluated. Depending on how the population is sampled, this design can be classifed as pre–post cluster cohort design where the same individuals are measured at each time point, and pre–post cluster cross-sectional design where different individuals are measured at baseline and follow-up. The GEE sample size methods for matched-pair cluster randomized design presented in this chapter can be applied to pre–post cohort cluster experimental design or pre–post cross-sectional cluster experimental design. Another variation is crossover cluster randomized design, where each cluster will receive both the intervention and control treatments with the order of receiving treatments in a random way. Crossover can be implemented at the cluster or individual level. Giraudeau et al. [34], Rietbergen and Moerbeek [35], Forbes et al. [36], and Li et al. [37] developed sample size estimations for crossover cluster randomized design, based on the GEE or generalized linear mixed model approach. It is also possible to extend the sample size methods for matched-pair cluster randomized design to crossover cluster randomized design. This chapter considers matched-pair cluster randomized design with continuous and binary outcomes. CRTs in medical and public health research often have count data as the outcome of interest, for example, the number of clinic visits in a study of respiratory diseases or the number of cigarettes per day in a smoking cessation study. Sample size estimations for CRTs with count outcomes have been proposed based on Poisson regression or negative binomial regression model [16, 38–40]. One interesting future direction is to develop the GEE sample size approach for matched-pair cluster randomized design with missing count outcomes. In this chapter, we focus on matched-pair cluster randomized design. Covariate-constrained randomization and stratifcation are other ways to promote covariate balance in CRTs. Covariate-constrained randomization involves selecting the cluster randomization scheme from a smaller candidate set of randomizations where covariates of interest meet a certain level of balance by treatment group in terms of a specifc balance metric, instead of from all possible randomizations. Various balance metrics have been proposed for the implementation of covariate-constrained randomization procedures in CRTs [41–44]. For stratifed cluster randomized design, clusters are categorized into strata according to a stratifcation factor or combinations of multiple stratifcation factors. Stratifcation factors typically include important prognostic factors associated with the outcome of interest, for example, cluster size, geographic location, and socio-economic status. Randomization is subsequently performed within each stratum. Chapter 4

Matched-Pair Cluster Randomized Design for Pragmatic Studies

91

discusses sample size estimation for stratifed cluster randomized design for pragmatic studies.

Appendix A.1 Eigenvalues of the Correlation Matrix Rc Find the eigenvalues l for Rc by|Rc - lI2m |= 0 , where æ (1 - r1 - l)Im + r1 Jm Rc - lI2m = ç è ( r2 - r3 )Im + r3 Jm

( r2 - r3 )Im + r3 Jm ö ÷. (1 - r1 - l)) Im + r1 Jm ø

A We know if A and B are square matrix, we have B Therefore,

B = A+ B A- B . A

Rc - lI2m = (1 - r1 - l + r2 - r3 )Im + ( r1 + r3 )Jm ´ (1 - r1 - l - r2 + r3 )Im + ( r1 - r3 )Jm . In Theorem 8.4.4 by Graybill [45], for the k ´ k exchangeable matrix C = (a - b)I + bJ , the determinant is given by C = ( a - b)k -1[a + (k - 1)b]. So we have Rc - lI2m = (1 - r1 - l + r2 - r3 )m-1 [1 - l + r2 + (m - 1)( r1 + r3 )] ´(1 - r1 - l - r2 + r3 )m-1 [1 - l - r2 + (m - 1)( r1 - r3 )] . Hence, the correlation matrix Rc has four distinct eigenvalues, l1 = 1 - r1 + r2 - r3 , l 2 = 1 + (m - 1)r1 + r2 + (m - 1)r3 , l 3 = 1 - r1 - r2 + r3 , l 4 = 1 + (m - 1)r1 - r2 - (m - 1)r3 .

A.2 Proof of Theorem 1

( )

Under MCAR, we separate An bˆ into two parts for the control and intervention as

92

Design and Analysis of Pragmatic Trials

( )

An bˆ

=

1 n

1 = n 1 = n

2

n

m

åååD

æ 1 × pijk bˆ 1 - pijk bˆ ç è Xijk

ijk

æ 1 × pijk ( bˆ ) 1 - pijk ( bˆ ) ç è Xijk

j=1 i=1 k=1 2

n

m

åååD j=1 i=1 k=1 n

( ))

( )(

ijk

(

Xijjk ö 2 ÷ + o p (1) Xijk ø

)

m

æ1 D1jk × p1jk (1 - p1jk ) ç è0 k=1

åå j=1

Xijk ö 2 ÷ Xijk ø

0ö 1 ÷+ 0ø n

n

m

ååD j=1 k=1

2 jk

æ1 × p2 jk (1 - p2 jk ) ç è1

1ö ÷ + op (1). 1ø

( )

As n ® ¥, An bˆ converges to æ m [ s1P1Q1 + s2P2Q2 ] A=ç ms2P2Q2 è

ms2P2Q2 ö ÷, ms2P2Q2 ø

æ s2P2Q2 1 ç ms1s2P1P2Q1Q2 è -s2P2Q2

-s2P2Q2 ö ÷. s1P1Q1 + s2P2Q2 ø

and A -1 =

( )

And then, Vn bˆ can be written as

( )

Vn bˆ

1 = n =

=

1 n

1 n

n

æ ç ç è

2

ö D ijk eˆ ijk Xijk ÷ ÷ k=1 ø m

å åå j=1 n

i=1

ìï m æ D1jk e1jk + D 2 jk e 2 jk ö üï í ç ÷ý D 2 jk e 2 jk ø þï îï k=1 è

åå j=1

n

Ä2

m

æ ç ç m ç ç ç k’ =1 ç ç ç è

ååå j=1 k=1

Ä2

+ op (1)

(D 2 jk e 2 jk )(D 2 jk¢e 2 jk¢ ) +(D1jk e1jk )(D1jk¢e1jk¢ ) +(D1jk e1jk )(D 2 jk¢e 2 jk¢ ) +(D 2 jk e 2 jk )(D1jk¢e1jk¢ ) (D 2 jk e 2 jk )(D 2 jk¢e 2 jk¢ ) +(D1jk e1jk )(D 2 jk¢e 2 jk¢ ) +

ö ÷ ( 2 jk¢e 2 jk¢ ) ÷ (D 2 jk e 2 jk )(D +(D 2 jk e 2 jk )(D1jk¢e1jk¢ ) ÷ ÷ + op (1), ÷ ÷ (D 2 jk e 2 jk )(D 2 jk¢e 2 jk¢ ) ÷÷ ø

Matched-Pair Cluster Randomized Design for Pragmatic Studies

93

( )

As n ® ¥, Vn bˆ converges to æ C1 + 2C2 + C3 V =ç è C1 + C2

C1 + C2 ö ÷, C1 ø

where C1

= ms2P2Q2 + m(m - 1)(s22 + t1s2t2 )r1P2Q2 ,

C2 C3

= m(s1s2 + t2 s1s2t1t2 )r2 P1P2Q1Q2 + m(m - 1)(s1s2 + t3 s1s2t1t2 )r3 P1P2Q1Q2 , = ms1P1Q1 + m(m m - 1)(s12 + t1s1t1 )r1P1Q1.

Therefore, the (2, 2)th element of å = A-1VA-1 is

s 2*2 =

-

1 + (m - 1)(s1 + t1t1 )r1 1 + (m - 1)(s2 + t1t2 )r1 + ms1P1Q1 ms2P2Q2

2 P1P2Q1Q2 ëé(s1s2 + t2 s1s2t1t2 )r2 + (m - 1)(s1s2 + t3 s1s2t1t2 )r3 ùû . ms1s2P1P2Q1Q2

References 1. B. Roozenbeek, A. Maas, H. Lingsma, I. Butcher, J. Lu, A. Marmarou; IMPACT Study Group, et al. Baseline characteristics and statistical power in randomized controlled trials: Selection, prognostic targeting, or covariate adjustment. Critical Care Medicine, 37(10):2683–2690, 2009. 2. A. Leon, H. Demirtas, C. Li, and D. Hedeker. Subject-level matching for imbalance in cluster randomized trials with a small number of clusters. Pharmaceutical Statistics, 12(5):268–274, 2013. 3. N. Ivers, I. Halperin, J. Barnsley, J.M. Grimshaw, B.R. Shah, K. Tu, et al. Allocation techniques for balance at baseline in cluster randomized trials: A methodological review. Trials, 13(1):1–9, 2012. 4. E. Kestler, D. Walker, A. Bonvecchio, S. Tejada, and A. Donner. A matched pair cluster randomized implementation trial to measure the effectiveness of an intervention package aiming to decrease perinatal mortality and increase institution-based obstetric care among indigenous women in Guatemala: Study protocol. BMC Pregnancy and Childbirth, 13(1):1–9, 2013.

94

Design and Analysis of Pragmatic Trials

5. K. Imai, G. King, and C. Nall. The essential role of pair matching in clusterrandomized experiments, with application to the Mexican universal health insurance evaluation. Statistical Science, 24(1):29–53, 2009. 6. J. Hill, and M. Scott. Comment: The essential role of pair matching. Statistical Science, 24(1):54–58, 2009. 7. C. Rutterford, A. Copas, and S. Eldridge. Methods for sample size determination in cluster randomized trials. International Journal of Epidemiology, 44(3):1051– 1067, 2015. 8. J.S. Preisser, M.L. Young, D.J. Zaccaro, and M. Wolfson. An integrated populationaveraged approach to the design, analysis and sample size determination of cluster-unit trials. Statistics in Medicine, 22(8):1235–1254, 2003. 9. N.A. Obuchowski. On the comparison of correlated proportions for clustered data. Statistics in Medicine, 17(13):1495–1507, 1998. 10. M.H. Fiero, S. Huang, E. Oren, and M.L. Bell. Statistical analysis and handling of missing data in cluster randomized trials: A systematic review. Trials, 17(1):72, 2016. 11. E.L. Turner, F. Li, J.A. Gallis, M. Prague, and D.M. Murray. Review of recent methodological developments in group-randomized trials: Part 1-design. American Journal of Public Health, 107(6):907–915, 2017. 12. M. Taljaard, A. Donner, and N. Klar. Accounting for expected attrition in the planning of community intervention trials. Statistics in Medicine, 26(13):2615– 2628, 2007. 13. D.A. Harville. Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association, 72(358):320–338, 1977. 14. A. Roy, D. Bhaumik, S. Aryal, and R. Gibbons. Sample size determination for hierarchical longitudinal designs with differential attrition rates. Biometrics, 63(3):699–707, 2007. 15. L. Mwandigha, K. Fraser, A. Racine-Poon, M. Mouksassi, and A. Ghani. Power calculations for cluster randomized trials (CRTs) with right-truncated Poissondistributed outcomes: A motivating example from a malaria vector control trial. International Journal of Epidemiology, 49(3):954–962, 2020. 16. F. Li, and G. Tong. Sample size and power considerations for cluster randomized trials with count outcomes subject to right truncation. Biometrical Journal, 63(5), 1052–1071, 2021. 17. X. Xu, H. Zhu, and C. Ahn. Sample size considerations for matched-pair cluster randomization design with incomplete observations of continuous outcomes. Contemporary Clinical Trials, 104:106336, 2021. 18. K.Y. Liang, and S.L. Zeger. Longitudinal data analysis using generalized linear models. Biometrika, 73(1):13–22, 1986. 19. D.B. Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976. 20. G. Yin. Clinical Trial Design: Bayesian and Frequentist Adaptive Methods. John Wiley & Sons, 2012. 21. D.M. Murray. Design and Analysis of Group-Randomized Trials. Oxford University Press, 1998. 22. R.M. Turner, R.Z. Omar, and S.G. Thompson. Bayesian methods of analysis for cluster randomized trials with binary outcome data. Statistics in Medicine, 20(3):453–472, 2001.

Matched-Pair Cluster Randomized Design for Pragmatic Studies

95

23. P. Li, and D.T. Redden. Small sample performance of bias-corrected sandwich estimators for cluster-randomized trials with binary outcomes. Statistics in Medicine, 24(2):281–296, 2015. 24. J. Preisser, B. Lu, and B. Qaqish. Finite sample adjustments in estimating equations and covariance estimators for intracluster correlations. Statistics in Medicine, 27(27):5764–5785, 2008. 25. G. Kauermann, and R.J. Carroll. A note on the effciency of sandwich covariance matrix estimation. Journal of the American Statistical Association, 96(456):1387– 1396, 2001. 26. L.A. Mancl, and T.A. DeRouen. A covariance estimator for GEE with improved small-sample properties. Biometrics, 57(1):126–134, 2001. 27. S. Andrade, C. Lachat, A. Ochoa-Aviles, R. Verstraeten, L. Huybregts, D. Roberfroid, et al. A school-based intervention improves physical ftness in Ecuadorian adolescents: A cluster-randomized controlled trial. International Journal of Behavioral Nutrition and Physical Activity, 11(1):153, 2014. 28. X. Xu, H. Zhu, A.Q. Hoang, and C. Ahn. Sample size considerations for matched-pair cluster randomization design with incomplete observations of binary outcomes. Statistics in Medicine, 2021. https://doi.org/10.1002/sim.9131. 29. J. Ma, L. Thabane, J. Kaczorowski, L. Chambers, L. Dolovich, T. Karwalajtys, et al. Comparison of Bayesian and classical methods in the analysis of cluster randomized controlled trials with a binary outcome: The community hypertension assessment trial (CHAT). BMC Medical Research Methodology, 9:37–46, 2009. 30. J.M. Robins, A. Rotnitzky, and L.P. Zhao. Analysis of semiparametric regressionmodels for repeated outcomes in the presence of missing data. Journal of the American Statistical Association, 90(429):106–121, 1995. 31. A. Manatunga, M. Hudgens, and S. Chen. Sample size estimation in cluster randomized studies with varying cluster size. Biomerical Journal, 43(1):75–86, 2001. 32. S.H. Kang, C. Ahn, and S.H. Jung. Sample size calculation for dichotomous outcomes in cluster randomization trials with varying cluster size. Drug Information Journal, 37(1):109–114, 2003. 33. X. Xu, H. Zhu, and C. Ahn. Sample size considerations for stratifed cluster randomization design with binary outcomes and varying cluster size. Statistics in Medicine, 38(18):3395–3404, 2019. 34. B. Giraudeau, P. Ravaud, and A. Donner. Sample size calculation for cluster randomized cross-over trials. Statistics in Medicine, 27(27):5578–5585, 2008. 35. C. Rietbergen, and M. Moerbeek. The design of cluster randomized crossover trials. Journal of Educational and Behavioral Statistics, 36(4):472–490, 2011. 36. A.B. Forbes, M. Akram, D. Pilcher, J. Cooper, and R. Bellomo. Cluster randomised crossover trials with binary data unbalanced cluster sizes: Application to studies of near-universal interventions in intensive care. Clinical Trials, 12(1):34– 44, 2015. 37. F. Li, A.B. Forbes, E.L. Turner, and J.S. Preisser. Power and sample size requirements for GEE analyses of cluster randomized crossover trials. Statistics in Medicine, 38(4):636–649, 2019. 38. A. Amatya, D. Bhaumik, and R. D. Gibbons. Sample size determination for clustered count data. Statistics in Medicine, 32(24):4162–4179, 2013.

96

Design and Analysis of Pragmatic Trials

39. D. Li, S. Zhang, and J. Cao. Sample size calculation for clinical trials with correlated count measurements based on the negative binomial distribution. Statistics in Medicine, 38(28):5413–5427, 2019. 40. J. Wang, S. Zhang, and C. Ahn. Sample size calculation for count outcomes in cluster randomization trials with varying cluster sizes. Communications in Statistics-Theory and Methods, 49(1):116–124, 2020. 41. B.R. Carter, and K. Hood. Balance algorithm for cluster randomized trials. BMC Medical Research Methodology, 8:65, 2008. 42. E. De Hoop, S. Teerenstra, B.G. van Gaal, M. Moerbeek, and G. Borm. The “best balance” allocation led to optimal balance in cluster-controlled trials. Journal of Clinical Epidemiology, 65(2):132–137, 2012. 43. F. Li, Y. Lokhnygina, D.M. Murray, P.J. Heagerty, and E.R. DeLong. An evaluation of constrained randomization for the design and analysis of grouprandomized trials. Statistics in Medicine, 35(10):1565–1579, 2016. 44. E.L. Chaussee, L.M. Dickinson, and D.L. Fairclough. Evaluation of a covariateconstrained randomization procedure in stepped wedge cluster randomized trials. Contemporary Clinical Trials, 105:106409, 2021. 45. F.A. Graybill. Matrices with Applications in Statistics. Brooks/Cole, 1983.

4 Stratifed Cluster Randomized Design for Pragmatic Studies

4.1 Introduction of Stratified Cluster Randomized Design A fundamental design issue of cluster randomized trials (CRTs) is that a simple randomized CRT is often not as effective in reducing imbalance of baseline prognostic factors, threatening the internal validity of the trial. To address this issue, matched-pair and stratifed randomized designs are frequently implemented in CRTs. Matched-pair randomized design can guarantee balance on selected confounders by matching on them, provide tightly balanced treatment groups, and thus can have higher effciency in estimation than stratifed randomized design. However, problems with this design include the diffculty in obtaining close matches sometimes, diffculty in proper estimation of intracluster correlation coeffcient (ICC) due to the variation in outcome between clusters being confounded with the intervention effect [1], and complication in testing individual-level risk factors [2]. Klar and Donner suggested that stratifed randomized design may be a better design choice because matching decreases the degrees of freedom [3] and may lead to a loss of statistical power when matching is not effective. Turner et al. [4] noted that stratifcation with strata of size four would provide virtually all the advantages of pair matching while avoiding the problems, and stratifcation may be preferred over pair matching for that reason. Stratifed cluster randomized design is essentially a replication of a simple cluster randomized design within two or more strata. It uses stratifcation to increase the balance in baseline covariates and may improve statistical power in CRTs. Stratifed cluster randomized design can also be described as a randomized block design or a multi-site cluster randomized design with treatment at the cluster level, where the terminology of block or site is used instead of stratum. Stratifcation factors adopted often include cluster size, cluster-level socio-economic status, geographic location, and categorized levels of prognostic factors. One special type of stratifed cluster randomized design is size-stratifed cluster randomized design. Cluster size is often associated with important DOI: 10.1201/9781003126010-4

97

98

Design and Analysis of Pragmatic Trials

cluster-level factors and has been recognized as a surrogate for within-cluster dynamics that predicts outcomes in CRTs [5]. It is possible that in some cases the relevant cluster-level factor won’t be identifed or recorded or it is hard to measure accurately. For example, if the cluster size is associated with cluster-level socio-economic status, stratifying by cluster size might be more feasible than by a constructed socio-economic index. Moreover, since standard randomization procedure may fail to achieve balance in cluster size between treatment groups especially when the number of clusters is small to moderate, it is common to adopt size-stratifed cluster randomized design, where clusters are frst categorized into strata based on the size (e.g., small, medium, and large) and then randomized into treatment groups within each stratum. For continuous outcomes, Lewsey [6] conducted a simulation study to evaluate the gain in statistical power of size-stratifed cluster randomized design over simple cluster randomized design, where cluster size is associated with an important cluster-level factor which is predictive of the outcome but unaccounted for in the data analysis. The results demonstrated that if the association is strong enough and the number of randomized clusters is too few to achieve balance in the stratifcation factor (i.e., cluster size) between treatment groups, stratifed cluster randomized design will have greater power than simple cluster randomized design and the superiority is related to the number of clusters. Sample size estimation and power analysis for stratifed cluster randomized design have been investigated, by using the linear mixed effect model or generalized estimating equation (GEE) approach for continuous outcomes [7, 8] and the Cochran–Mantel–Haenszel statistics or the GEE approach for binary outcomes [9–11]. In this chapter, we discuss sample size estimations for stratifed cluster randomized design for pragmatic studies. Design considerations for pragmatic stratifed CRTs are discussed in Section 4.2. Sample size estimations for continuous and binary outcomes are presented in Sections 4.3 and 4.4, respectively. Section 4.5 includes further readings.

4.2 Considerations for Pragmatic Stratified CRTs Similar to simple randomized CRTs, intracluster correlation coeffcient (ICC) needs to be properly accounted for in the design of stratifed CRTs. Moreover, since stratifcation is implemented in the design phase and stratifcation factors are often associated with outcome variables, a sample size estimation based on such stratifcation is necessary and more appropriate than an estimation ignoring the stratifcation. Although stratifcation in CRTs is generally expected to lead to a sample size reduction when the stratifcation variable is predictive of the outcome [6, 9, 12, 13], it is important to investigate whether and when stratifcation will have a practically relevant impact on

Stratifed Cluster Randomized Design for Pragmatic Studies

99

the sample size reduction for CRTs. For the two-stratum CRTs with binary outcomes, Kennedy-Shaffer and Hughes [11] found that stratifcation in CRTs with small cluster sizes is unlikely to achieve substantial sample size reduction when there is a low overall probability of events. When the overall probability of events is moderate or high and the stratifcation variable is highly predictive of the outcome, stratifcation can have a substantial impact even for CRTs with small cluster sizes. Also as the mean cluster size increases, substantial sample size reduction can be achieved with stratifcation, including in situations where there is a low overall probability of events. Another important design consideration is that clusters are often naturally formed with random sizes in real-world pragmatic CRTs and cluster size can vary greatly with the trial setting. Specifcally for size-stratifed CRTs, it is rare to be able to construct strata consisting of equal-sized clusters in practice, and a more realistic scenario is that each stratum contains clusters with similar sizes. Thus, it would be more appropriate to characterize cluster sizes within a stratum through a distribution (with specifc mean and variance) instead of a single value [14]. Varying cluster size has important implications on statistical effciency. For simple randomized CRTs, Guittet et al. [15] and Carter [16] examined the impact of varying cluster size on statistical power using simulations, which demonstrated the greatest power reduction with few groups or high ICC or both. In the presence of varying cluster sizes, one commonly used ad hoc approach is to ignore the variability in cluster size and replace unequal cluster sizes by the average cluster size within each stratum in a sample size formula developed under equal cluster size [17] (average cluster size approximation). One may also replace unequal cluster sizes by the harmonic mean of cluster sizes within each stratum [18] (harmonic mean cluster size approximation) or by the smallest anticipated cluster size within each stratum (minimum cluster size approximation). Numerical studies by Xu et al. [10] suggested that, for stratifed CRTs with binary outcomes, when the intracluster correlation coeffcient (ICC) is positive, the average (minimum) cluster size approximation would underestimate (overestimate) the actual required number of clusters for each group per stratum and lead to underpowered (overpowered) clinical trials, while the harmonic mean cluster size approximation would depend on the distribution of cluster sizes through its harmonic mean. For a simple cluster randomized design, Manatunga et al. [17] developed a method to incorporate the variability in cluster size into sample size estimation for continuous outcomes and Kang et al. [19] derived sample size estimation for binary outcomes. The coeffcient of variation (CV) of cluster size, defned as the ratio of the standard deviation over the mean, is used to adjust for varying cluster size in the sample size formulas. This idea was extended to handle varying cluster size in sample size estimation for stratifed cluster randomized design with continuous and binary outcomes in Wang et al. [8] and Xu et al. [10], respectively. Alternative adjustment methods for handling varying cluster size are developed based on the relative effciency. For example, for CRTs with continuous outcomes,

100

Design and Analysis of Pragmatic Trials

van Breukelen et al. [20] suggested dividing the sample size estimated under equal cluster size by the relative effciency for the expected cluster size distribution, which is a function of the ICC and the mean and variance of the cluster size.

4.3 Stratified Cluster Randomized Design with Continuous Outcomes In this section, we discuss stratifed cluster randomized design with continuous outcomes. Moerbeek and Teerenstra [7] provided a sample size method for multi-site CRTs with fxed cluster size. Konstantopoulos [21], Schochet [22], and Bloom and Spybrook [23] among others discussed power analysis for multi-site CRTs. In this section, we present closed-form sample size formulas for size-stratifed CRTs with continuous outcomes based on the GEE approach and under two assumptions: equal or varying cluster size within each stratum [8]. 4.3.1 Sample Size Estimation Based on GEE Approach Suppose that in a CRT the clusters are categorized into L strata based on their sizes. The number of clusters within each stratum is denoted by J l ( l = 1, ˜ , L ). Assume that the cluster sizes within the lth stratum, denoted by nlj ( j = 1, ˜ , J l ), are independent samples from a certain discrete distribution with mean q l and variance tl2 . The total sample size of a stratifed CRT is n =

å å L

Jl

l=1

j=1

nlj , which depends on the number of strata (L), the number

of clusters in each stratum ( J l ), and cluster sizes ( nlj ). Let Ylji denote the continuous outcome measured on the ith ( i = 1, ˜ , nlj ) subject from the jth cluster of the lth stratum, and xlj = 0 /1 indicates that all subjects in the (l , j) th cluster are assigned to the control/treatment arm. The randomization probability r º P(xlj = 1) is used to generally accommodate balanced ( r = 0.5 ) and unbalanced randomization. The sample size estimation is based on the generalized estimating equation (GEE) approach [24], which only requires specifying models for the frst two moments of Ylji . The mean model is specifed as E(Ylji ) = b 0 + b1xlj , where b 0 is the intercept representing the baseline mean response under control, and b1 represents the treatment effect. The null hypothesis of interest is H 0 : b1 = 0. The covariance of Ylj = (Ylj1 , ˜ , Yljnlj )¢ , defned as å lj =Var(Ylj ) , is assumed to be an nlj ´ nlj matrix with the diagonal elements being s 2 and

101

Stratifed Cluster Randomized Design for Pragmatic Studies

off-diagonal elements being rs 2. Here s 2 is the variance of random error and r is the ICC quantifying the similarity of subjects within the same cluster. The observations are assumed to be independent across clusters. Defne b = ( b 0 , b1 )¢ , Xlj = 1xlj , and Zlj = (1, Xlj ) . Here 1 denotes a vector with all elements being 1. Then we have E(Ylj ) = Zlj b . Using an independent working correlation structure, it is easy to show that the GEE estimator of b is æ bˆ = ç ç è

ö Zlj’Zlj ÷ ÷ j=1 ø Jl

L

åå l=1

-1

Jl

L

ååZ ’Y . lj

lj

l=1 j=1

According to Liang and Zeger [24], as n ® ¥, n( bˆ - b ) approximately follows a normal distribution with mean 0 and variance V, which is consistently estimated by æ Vˆ = n ç ç è

ö Zlj’Zlj ÷ ÷ j=1 ø Jl

L

åå l=1

-1

æ ç ç è

öæ Zlj’eˆlj eˆlj’Zljj ÷ç ÷ç j=1 øè Jl

L

åå l=1

L

åå l=1

-1

ö Zlj’Zlj ÷ . ÷ j=1 ø Jl

Here eˆlj = Ylj - Zlj bˆ is the residual term. We reject H 0 : b1 = 0 if ˆ a is the level of n |bˆ1 |/sˆ 1 > z1-a/2 , where sˆ 12 is the (2,2)th element of V, two-sided type I error, and z1-a/2 is the 100(1 - a / 2) th percentile of the standard normal distribution. After some algebra, it can be shown that s 12 =lim n®¥sˆ 12 = n(v1 + v2 ) with

v1 =

v2 =

s2

s2

å å x éën + n ( n - 1) r ùû , æ x n ö÷ ç èå å ø L

Jl

l=1

j=1

lj

lj

L

Jl

l=1

j=1

lj

lj

2

lj lj

å å (1 - x ) éën + n ( n - 1) r ùû . æ (1 - x )n ö÷ ç ø èå å L

Jl

l=1

j=1

lj

L

lj

lj

2

Jl

l=1

j=1

lj

lj

lj

Using the fact that E(nlj ) = q l and Var (nlj ) = tl2 , for large J l ,

v1 »

s2

å

L

{ é êëå

l=1

}

rJ l E éë nlj + nlj ( nlj - 1) r ùû L l=1

rJ lE ( nlj ) ùú û

2

102

Design and Analysis of Pragmatic Trials

s2 =

å

L l=1

é1 ù J lq l2 ê ( 1 - r ) + 1 + xl2 r ú q û. ë l 2 L ö æ rç J lq l ÷ l=1 è ø

(

)

å

Here xl = tl /q l is the coeffcient of variation (CV) within stratum l. Similarly, v2 can be approximated by

s2 v2 »

å

L

é1 ù J lq l2 ê ( 1 - r ) + 1 + xl2 r ú q ë l û. 2 L æ ö (1 - r) ç J lq l ÷ l=1 ø è

(

l =1

)

å

It is obvious that v1 = v2 under balanced randomization ( r = 0.5 ). The power analysis of a stratifed CRT requires the specifcation of the following parameters: the type I error a , true treatment effect b1 , randomization probability r, variance s 2 , ICC r , the number of strata L, the numbers of clusters within strata J = {J l : l = 1,˜, L} , the stratum-specifc mean cluster sizes q = {q l : l = 1, ˜ , L} , and the stratum-specifc CVs x = {xl : l = 1,˜, L} . The testing power under sample size n =

å å L

Jl

l=1

j=1

nlj can be evaluated by

æ |b1 | ö Fç - z1-a/2 ÷ , ø è v1 + v2

(4.1)

where F(×) is the standard normal cumulative distribution function. This power analysis approach (4.1) is fexible to accommodate a wide spectrum of pragmatic trial scenarios, including unbalanced randomization (through r), different numbers of clusters across strata (through J = {J l : l = 1,˜, L} ), arbitrary averaged cluster sizes within strata (through q = {q l : l = 1, ˜, L} ), and different variability in cluster size within each stratum (through x = {xl : l = 1,˜, L} . Sample size estimation is more complicated because the total sample size n =

å å L

Jl

l=1

j=1

nlj »

å

L

J lq l involves many variables: L and

l=1

{(J l , q l ) : l = 1, ˜, J } . Considering a relatively simpler scenario of balanced randomization ( r = 0.5 ) and an equal number of clusters across strata ( J1 = ˜ = J L = J ) would help understand the impact of various parameters on sample size requirement in a size-stratifed CRT. The required number of clusters per stratum (J) can be solved by setting expression (4.1) to be the target power 1 - g , given L, q = {q l : l = 1,˜, L} , and other parameters:

4(z1-a/2 + z1-g ) s J= b12 2

2

å ´

L l=1

é1 ù q l2 ê ( 1 - r ) + 1 + xl2 r ú ë ql û. 2 L æ q l ö÷ ç l=11 è ø

(

å

)

(4.2)

103

Stratifed Cluster Randomized Design for Pragmatic Studies

é1 ù When L = 1, the second term in (4.2) simplifes to ê ( 1 - r ) + 1 + xl2 r ú , ë ql û which is identical to the correction term derived by Manatunga et al. [17] for simple randomized CRTs with continuous outcomes and varying cluster sizes. When there is no variability in cluster size xl = 0 for l = 1, ˜ , L , and hence nlj = q l , the number of clusters per stratum required is

(

4(z1-a/2 + z1-g ) s J* = b12 2

2

å ´

L l=1

)

é1 ù q l2 ê ( 1 - r ) + r ú q ë l û. 2 L ö æ ql ÷ ç l=1 è ø

å

4.3.2 Relative Sample Size Change Due to Varying Cluster Size Assuming balanced randomization ( r = 0.5 ) and an equal number of clusters across strata, the relative change in sample size due to varying cluster size R is defned as J R = * -1= J

r

å

L

q l2xl2

. é1 ù q ê (1 - r ) + r ú l=1 ë ql û In pragmatic CRTs, ICCs are typically positive ( 0 < r < 1 ). Under this assumption, R is always positive. That is, variability in cluster size always leads to increased sample size requirement. The common practice of using the average cluster size in each stratum for sample size estimation would lead to underpowered clinical trials. For the extreme case of r = 0 , a CRT is equivalent to an individually randomized trial. The variability in cluster size has no impact on sample size ( R = 0 ). As an illustration, Figure 4.1 presents a plot of the relative change in sample size R vs ICC ( r ), in a simple scenario where the numbers of clusters and CVs are equal across strata. Without loss of generality, the common CV is set as x = 1. Two sets of q are explored: q 1 = (10, 30, 50) and q 2 = (20, 30, 40) . The fgure shows that the curve corresponding to q 1 is always above that corresponding to q 2 , because q 1 has a greater between-stratum variability in cluster size than q 2 although their overall mean cluster sizes are the same. It is noteworthy that the assumed value of x only affects the range of the vertical axis. The shape and relative position of the two curves remain unchanged.

å

L

l=1

2 l

4.3.3 Example Investigators are interested in assessing the effect of using an information technology (IT)-based novel technology platform on the well-being of patients with a triad of chronic kidney disease, diabetes, and hypertension. The size-stratifed cluster randomized design will be employed. Patients are

104

Design and Analysis of Pragmatic Trials

1.0

0.8

R

0.6

0.4

0.2

1

0.0

2

0.0

0.2

0.4

0.6

0.8

= (10, 30, 50) = (20, 30, 40) 1.0

FIGURE 4.1 A numerical study to explore the impact of ICC ( r ) on relative change in sample size (R). Assuming the numbers of clusters and CVs to be equal across strata. The common CV is set at x = 1 . The means of cluster sizes are set as q : q 1 = (10, 30, 50) and q 2 = (20, 30, 40) .

clustered by clinics, which are stratifed by the size of clinic and randomly allocated at a 1:1 ratio to either the IT-based intervention group or a control group (standard medical care) within each stratum. Here, clinics are stratifed into three groups of small, medium, and large based on the size of the clinic. Based on the assignment of his or her clinic, each patient will be assigned to either the intervention group or a control group. The primary outcome of well-being will be measured using the instrument of Bradely et al. [25] after three-month intervention. We estimate the sample size based on the comparison of well-being scores between control and IT-based intervention groups. From preliminary data, the mean of the well-being score at 3-month intervention was 30 with a standard deviation of 12 across all three strata in the control group. We expect that the mean well-being score in the IT-based intervention group will be 10% higher than that in the standard medical care group. We assumed an equal standard deviation between two intervention groups. From a preliminary dataset, the ICC ( r ) is estimated as 0.03. To be conservative, a r of 0.05 is assumed in the sample size calculation. The average cluster sizes (i.e., the average numbers of patients in a clinic) are 5, 17, and 65 in small, medium, and large strata, respectively, with variances of 6, 25, and 500 in small, medium, and large strata. With an equal number of 30 clinics for small, medium, and large strata, the power of the study becomes 90.13% at a two-sided 5% signifcance level. When the number of clinics is

105

Stratifed Cluster Randomized Design for Pragmatic Studies

unequal with 40, 30, and 20 clinics in small, medium, and large strata, the power of the study becomes 84.32%.

4.4 Stratified Cluster Randomized Design with Binary Outcomes In this section, we discuss stratifed cluster randomized design with binary outcomes. Donner [9] developed a sample size method for designs of stratifed CRTs, where the cluster size itself may be a stratifying factor. This approach generalizes the sample size formula developed by Woolson et al. [26] to stratifed categorical data under simple randomization, and it assumes equal cluster size within each stratum but different across strata. Kennedy-Shaffer and Hughes [11] provided an analytic formula for the sample size estimation for stratifed CRTs and the ratio of required sample sizes in stratifed vs simple randomized CRTs with binary outcomes, based on the GEE approach. While their formula is quite complicated and relies on a correct specifcation of within-stratum design effects, especially with varying cluster size within each stratum. In this section, we present closedform sample size formulas for stratifed CRTs with binary outcomes, based on the Cochran–Mantel–Haenszel (CMH) statistics [27] and under equal or varying cluster size within each stratum [10]. 4.4.1 Sample Size Estimation Based on CMH Statistic Suppose that in a CRT, clusters are grouped into K strata and are randomly assigned to either the intervention or control group. For simplicity, we consider a balanced randomization, with equal number of clusters assigned to each group. Let J k denote the number of clusters for each group for stratum k, k = 1,..., K . Let n1jk and n2 jk denote the cluster size for cluster j in stratum k for the intervention and control group, respectively, j = 1,..., J k , k = 1,..., K . We assume that n1jk ’s and n2 jk ’s are independently and identically distributed ( i.i.d. ) samples with mean mk and variance t2k , k = 1,..., K . The total number of subjects of a stratifed CRT is N =

å

K

nk =

k=1

å

K

(n1k + n2k ) =

k=1

å å K

Jk

k=1

j=1

(n1jk + n2 jk ) , where nk is the

number of subjects required for stratum k, n1k =

å

Jk

n1jk and n2k =

j=1

å

Jk

n2 jk

j=1

are the number of subjects for stratum k for the intervention group and control group, respectively. Let Yijk1 and Yijk2 be the binary outcome variable (1 for response, 0 for no response) of subject i in cluster j of the stratum k for the intervention and control group, respectively. Let p1k and p2k be the response probability in stratum k for the intervention group and the

106

Design and Analysis of Pragmatic Trials

control group, respectively, k = 1,..., K . Let dk = p1k - p2k denote the difference in the estimated response probability between intervention and conJk n1jk Jk trol groups in stratum k, where p1k = çæ Yijk1 )/( n1jk ö÷ , and j=1 i=1 j=1 è ø Jk n2 jk Jk p2k = çæ Yijk2 )/( n2 jk ö÷ . Let r be ICC quantifying the similarj=1 i=1 j=1 è ø ity of subjects within the same cluster. The odds ratio between intervention and response in stratum k is q k = [p1k (1 - p2k )]/[p2k (1 - p1k )], k = 1,..., K . Let q denote the common odds ratio. The interest is to estimate the sample size J k , k = 1,..., K , for testing H 0 : q = 1 vs H a : q ¹ 1 with a power of 1 - b at a two-sided signifcant level of a . Donner [9] generalized the sample size formula of Woolson et al. [26] for stratifed categorical data under simple randomization and proposed the sample size method for stratifed cluster randomized design with binary outcomes. Donner’s method assumes the cluster size to be equal within, but different across strata. It accounts for clustering in a stratifed cluster randomized design, but not the variability in cluster size. Xu et al. [10] provided the sample size formula for stratifed CRTs with varying cluster size. The Cochran– Mantel–Haenszel statistic for testing the hypotheses: H 0 : q = 1 vs H a : q ¹ 1 is

å å

å å

å

å

å wd , å w var(d ) K

C0 =

k k

k=1

K

k=1

2 k

k

where wk = n1k n2k /(n1k + n2k ) and var(dk ) = var( p1k ) + var( p2k ) . The large sample distribution of C0 converges to a standard normal distribution under the null hypothesis H 0 : q = 1 . More generally, the statistic

Ca =

å

K k=1

wk éë(p1k - p2k ) - (p1k - p2k )ùû

å

K k=1

wk2var(dk )

converges to a standard normal distribution. The power for testing H 0 : q = 1 vs H a : q ¹ 1 is Pr(|C0 |> Z1-a/2 |q ¹ 1) , where Z1-a/2 is the 100(1 - a / 2)th percentile of the standard normal distribution. The estimation of testing power and sample size requires the asymptotic variance of dk = p1k - p2k . Since Yijk1 ’s are i.i.d. random samples with mean p1k and variance p1k (1 - p1k ) , and n1jk ’s are i.i.d. random samples with mean m k and variance t2k ,

å

p1k (1 - p1k ) var( p1k ) =

Jk j=1

å

n1jk + 2 æ ç è

å

Jk j=1

n1jk ö÷ j=1 ø

Jk

æ n1jk öp1k (1 - p1k )r ç ÷ è 2 ø

2

107

Stratifed Cluster Randomized Design for Pragmatic Studies

=

å n +n æ ö çå n ÷ ø è Jk

p1k (1 - p1k )

j=1

1jk

r - n1jk r )

2

Jk

j=1

2 1jk

.

1jk

As J k ® ¥ , var( p1k ) ® p1k (1 - p1k )

=

E(n1jk ) + r E(n12jk ) - r E(n1jk ) J k E2 (n1 jk )

p1k (1 - p1k ) éë rm k + rg k 2 mk + (1 - r )ùû , J kmk

where g k = tk / m k is the coeffcient of variation (CV). Let Mk = rmk + rg k 2 mk + (1 - r ) , then p1k has the variance of Mk p1k (1 - p1k )/(J k m k ), as J k ® ¥ , where Mk is considered as a correction factor. Similarly, var( p2k ) = Mk p2 k (1 - p2 k )/(J k m k ) , as J k ® ¥ . Therefore, as J k ® ¥ , the variance of dk under H 0 can be expressed as æ 1 1 var0 (dk ) = p k (1 - p k ) ç + J m J km k è k k where p k =

ö ÷ Mk , ø

p1k + p2k . The asymptotic variance of dk under H a is 2 é p (1 - p1k ) p2k (1 - p2k ) ù vara (dk ) = ê 1k + ú Mk . J kmk J kmk ë û

Let tk denote the fraction of subjects in the trial belonging to stratum k, where

åt K

1

k

= 1 . Then the sample size can be obtained by solving the fol-

lowing two equations: K

Z1-a /2 =

å k=1 K

å k=1

K

-Z1- b =

å k=1 K

Ntk p (1 - p k )Mk 4 k

Ntk (p1k - p2k ) 4

å k=1

Ntk (p1k - p2k ) 4

K

å k=1

,

Ntk (p 1k - p 2k ) 4

Ntk é p1k (1 - p1k ) + p2k (1 - p2k )ùû Mk 8 ë

(4.3)

.

(4.4)

108

Design and Analysis of Pragmatic Trials

Based on Equations (4.3) and (4.4), the total number of subjects N required for stratifed cluster randomized design is given by

( Z1-a /2T + Z1- b U ) N=

2

(4.5)

V2

where 1

K üï 2 1 ïì T = í tk éë rmk + rg k 2 mk + (1 - r )ûù p k (1 - p k )ý , 2 îï k=1 ïþ

å

ìï 1 U=í îï 8

1

üï 2 tk éë rmk + rg k 2 mk + (1 - r )ùû [p 1k (1 - p 1k ) + p 2k (1 - p 2k )]ý , ïþ k=1 K

å

V=

1 4

K

åt (p k

- p 2k ), p k = (p 1k + p 2k )/ 2.

1k

k=1

The estimated total number of subjects N depends on the mean and variance of the distribution of cluster size, in addition to the expected group difference, ICC, and type I and type II errors. Considering a relatively simple case of equal number of clusters in each stratum ( tk = mk /

å

K

mk

k=1

and J1 = J 2 = ... = J K = J ), the number of clusters per group per stratum is

å

J = N /(2

K

mk ) .

k=1

When there is no variability in cluster size ( g k = 0 for k = 1,..., K ), formula (4.5) reduces to N

*

(Z =

T * + Z1- b U *

1-a /2

)

V *2

2

,

(4.6)

where 1

K ù2 1é T * = ê tk ( rmk + 1 - r )p k (1 - p k )ú , 2 êë k=1 úû

å

ìï 1 U* = í îï 8

1

üï 2 tk ( rmk + 1 - r ) [p 1k (1 - p 1k ) + p 2k (1 - p 2k )]ý , ïþ k=1 K

å

V* =

1 4

K

åt (p k

k=1

1k

- p 2k ),

Stratifed Cluster Randomized Design for Pragmatic Studies

109

which is consistent with the formula derived by Donner [9]. The corresponding number of clusters per group in stratum k is J k* = N *tk /(2mk ) . Assuming an equal number of clusters in each stratum, the number of clusters per group

å

per stratum is J * = N * /(2

K

mk ) . Here, J * is also the required number of

k=1

clusters per group per stratum by the average cluster size approximation. Formula (4.6) can be modifed to calculate the required number of clusters per group per stratum by the harmonic mean cluster size approximation and by the minimum cluster size approximation, replacing mk by the harmonic mean of cluster sizes and the smallest anticipated cluster size, respectively. It is noted that the formulas for J * and J have similar forms, except for the additional term involving rg k 2 mk that accounts for the variability in cluster size. 4.4.2 Relative Sample Size Change Due to Varying Cluster Size Assuming equal number of clusters per group across strata, let R be the relative change in sample size (i.e., the number of clusters per group per stratum) due to varying cluster size, R = J / J * - 1 , where J and J * are the sample size under varying and equal cluster sizes, respectively. When ICC is positive, R is positive, indicating that the variability in cluster size leads to an increased sample size, and the average cluster size approximation would underestimate the sample size. Whereas when ICC is negative, the average cluster size approximation would overestimate the sample size. Although cases of negative ICC are uncommon, Hanley et al. [28] discussed such cases which involved the birth weights of human twins and lung sizes of animal littermates. In these cases, with limited space or nutrition, nature allows considerable inequality among individual “competitors”, therefore a negative within-twin or within-litter correlation. Xu et al. [10] numerically examined the impact of ICC and CV on the relative change in sample size, under a simple scenario where both numbers of clusters and CVs are equal across strata ( g 1 = ... = g k = g ). Figure 4.2 presents a plot of the relative change in sample size R vs ICC ( r ) , under different combinations of (g , m ) . The common CVs are set as g = 0.5 , 0.75, and 1, and the means of cluster sizes m are set as: m1 = (10, 30, 50) and m2 = (20, 30, 40) . The fgure shows that (i) the relative change in sample size R increases from 0 to g 2 , as ICC increase from 0 to 1; (ii) R increases as CV increases; (iii) R is greater for m1 , comparing with m2 . This is because that m1 has a higher between-stratum variability in cluster size than m2 , although their overall mean cluster sizes are the same. 4.4.3 Estimation of Clustering Parameter Sample size for stratifed cluster randomized design depends on the value of ICC, which is not known in many cases. ICC ( r ) can be estimated by the analysis of variance (ANOVA) method in Donald and Donner [29]. Ridout et

110

Design and Analysis of Pragmatic Trials

1.0000

1.0 =1

0.9 0.8 0.7

R

0.6

0.5625 = 0.75

0.5 0.4 0.3

0.2500

0.2

= 0.5

0.1 1 2

0.0 0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

= (10, 30, 50) = (20, 30, 40)

0.8

0.9

1.0

FIGURE 4.2 A numerical study to explore the impact of ICC ( r ) and CV ( g ) on relative change in sample size (R) . Assuming the numbers of clusters and CVs to be equal across strata. The common CVs are set as g = 0.5 , 0.75, and 1. The means of cluster sizes are set as m : m1 = (10, 30, 50) and m2 = (20, 30, 40) .

al. [30] evaluated the performance of various estimators of r for clustered binary data through simulation, and they showed that the ANOVA estimator performed well. ICC ( r ) can be estimated using observed data from a stratifed CRT as follows. Assuming that r is common to all clusters. One can estimate r1k and r2k in the intervention and control groups for stratum k ( k = 1,..., K ), respectively, and use the mean of these 2k estimates as the overall estimate of r . For the intervention group in stratum k, r1k can be estimated by MSB - MSW , MSB + (n0 - 1)MSW

rˆ1k =

where the between-group mean square (MSB) and within-group mean square (MSW) are

å

a 2jk j=1 n1jk

Jk

MSB = æ ç ç ak ç MSW = è

å

å

2 k Jk

a

j=1

n1jk

Jk - 1 ak2

Jk j=1

n1 jk

ö æ ÷ ç ÷-ç ÷ ç ø è

å

å

Jk

j=1

a 2jk j=1 n1jk

Jk

(n1jk - 1)

,

å

ak2 Jk j=1

n1jk

ö ÷ ÷ ÷ ø,

Stratifed Cluster Randomized Design for Pragmatic Studies

n1jk

a jk =

åY

1 ijk

J k n1jk

, ak =

i=1

ååY

1 ijk

111

,

j=1 i=1

and n0k is the adjusted mean cluster size Jk

n0k = nk -

å

Jk

(n1jk - nk )2

j=1

ån

(J k - 1)

1jk

, nk =

Jk

ån j=1

Jk

.

1jk

j=1

Similarly, r2k for the control group in stratum k can be estimated. Then, r is estimated by

å rˆ =

K k=1

rˆ1k +

å

2K

K k=1

rˆ2k

.

There are several methods for obtaining confdence intervals for the ICC estimates. Bonett [31] presented a formula for constructing a confdence interval for the ANOVA estimator of ICC, based on F-distribution and can be implemented using statistical software PASS. Turner et al. [32] presented Bayesian methods to construct confdence intervals. Ionan et al. [33] evaluated three different methods for constructing confdence intervals: generalized confdence interval method, modifed large sample method, and Bayesian method, and discussed how to implement these methods using statistical software programs. 4.4.4 Example Investigators plan to conduct a randomized pragmatic clinical trial of the management of patients with a triad of chronic kidney disease (CKD), diabetes, and hypertension, in order to determine whether a new intervention (a clinical support model enhanced by technology support) can reduce oneyear unplanned all-cause hospitalizations comparing with the standard medical care. A prospective stratifed cluster randomized design will be employed. Patients are clustered by clinics, which are stratifed by healthcare systems. The clinics are randomly allocated with equal probability to either the intervention or the standard medical care group within each stratum. We calculate the sample size based on the comparison of one-year unplanned all-cause hospitalization rates between the intervention group and the control group. From preliminary data, we observe that the rate of unplanned allcause hospitalization during the one-year follow-up period is 14% across all four large healthcare systems in the standard medical care group. We expect that the hospitalization rate in the intervention group will be 3% lower than that in the standard medical care group, corresponding to a common odds

112

Design and Analysis of Pragmatic Trials

ratio q = 0.76 (intervention vs standard care). Electronic health records (EHR) show that the numbers of patients with coexistent CKD, hypertension, and diabetes are 4,419, 4,738, 4,175, and 1,093 in the four large healthcare systems. The fractions of patients belonging to each healthcare system are t1 = 0.306 , t2 = 0.329 , t3 = 0.289 , and t4 = 0.076 with the total number of patients equal to 14,425. The numbers of clinics are 25, 40, 50, and 9 in the four large healthcare systems. The average number of patients for each clinic mk is 177, 119, 84, and 122, with standard deviation tk of 75, 53, 36, and 58, respectively. In the sample size calculation, we assume that the fractions of patients from each healthcare system in the trial remain the same as those in the EHR. The preliminary data suggests that ICC is estimated as rˆ = 0.008 (95% confdence interval: 0.002 - 0.012 , calculated by the ANOVA method). If we assume a conservative estimate, rˆ = 0.015 , to detect a 3% difference in the rate of unplanned all-cause hospitalization, 11, 17, 22, and 4 clinics for each group from the four large healthcare systems (a total of 12,387 patients) are needed to achieve 80% power at a two-sided 5% signifcance level by the proposed method. Here, 12,387 patients are 85.9% of 14,425 patients from the four healthcare systems participating in the study. By the average cluster size approximation, we would need to enroll 10, 16, 19, and 4 clinics for each group from the four large healthcare systems, corresponding to a total of 10,991 patients. Note that the required numbers of clinics for each group from the fourth healthcare system by the proposed method and average cluster size approximation are the same and equal to 4, although the required numbers of patients for each group from the fourth healthcare system by the two methods differ (469 patients and 416 patients, respectively), due to the issue of rounding up numbers of clinics in the calculation. Both the number of clinics for each group from the four large healthcare systems and the total number of patients required by the average cluster size approximation are smaller than those by the proposed method, which is likely to lead to an underpowered study without accounting for the variability in the size of the clinic. Further, if stratifcation is ignored, the number of clinics for each group needed is 45, corresponding to a total of 11,428 patients, by the sample size method proposed by Kang et al. [19] that would only account for varying cluster size. This total number of patients is also smaller than that of the proposed method accounting for both stratifcation and varying cluster size.

4.5 Further Readings In this chapter, we have discussed sample size estimations for stratifed cluster randomized design for pragmatic studies. In real-world CRTs, clusters are often formed with random sizes and such variability must

Stratifed Cluster Randomized Design for Pragmatic Studies

113

be appropriately accounted for in the sample size estimation. We present closed-form sample size formulas for stratifed cluster randomized design with continuous outcomes based on the GEE approach and that with binary outcomes based on the CMH statistic, under equal or varying cluster size within each stratum. The sample size methods can be used generally for stratifed CRTs and multi-site CRTs, where the stratifying factor can be cluster size, geographic location, categorized levels of prognostic factors, or study site. The methods are fexible to accommodate arbitrary numbers of clusters within each stratum. The variability in cluster size is incorporated into sample size estimation by using the CV of cluster size, which only requires information about the frst two moments instead of the specifc distribution of cluster size. The closed-form formulas provide insight into the impact of various design parameters (mean difference or odds ratio, ICC, mean and CV of cluster size, etc.) on the sample size. The relative change in sample size due to varying cluster size is theoretically derived, and the relative change in sample size increases as ICC and CV increase. For stratifed cluster randomized design with binary outcomes and varying cluster size, numerical studies by Xu et al. [10] have shown that when ICC is positive (negative), the average cluster size approximation would underestimate (overestimate) the required number of clusters per group per stratum. Whereas the minimum cluster size approximation would overestimate (underestimate) the required number of clusters per group per stratum, therefore, a good estimate of ICC is essential. The harmonic mean cluster size approximation has some uncertainty and would depend on the specifc design confguration. The sample size methods presented in this chapter are developed based on the asymptotic approximation, which might not be satisfactory when the number of clusters per group per stratum is small. Numerical studies have demonstrated that the nominal power is preserved when the number of clusters per group per stratum is as small as 12 [8, 10]. The sample size formula for continuous outcomes is derived under the independent working correlation structure, which greatly simplifes computation and allows one to obtain a closed-form sample size formula, and the parameter estimators remain consistent [34, 35]. It has been shown that the estimators under the independent working correlation structure are highly effcient compared with those under the true correlation structure [24]. The slight loss in effciency due to not using the true correlation structure means that the sample size formula for continuous outcomes provides a conservative estimation of sample size. This chapter considers sample size estimations for two-arm stratifed CRTs. Pragmatic clinical trials with multiple intervention arms are becoming increasingly popular, where participants are randomized to one of several intervention groups. For example, there may be two or more experimental intervention groups with a common control group, or two comparator control groups such as a placebo group and a standard treatment group. Evaluating more than one intervention concurrently increases the chances of fnding an

114

Design and Analysis of Pragmatic Trials

effective intervention [36]. Multi-arm pragmatic trials are an attractive way of optimizing resources and simultaneously testing various intervention strategies. Although many authors have studied methods for the design and analysis of multi-arm individually randomized trials and multi-arm CRTs [37–40], multi-arm stratifed CRTs can be more complex in their design and analysis. Complications of such trials are directly related to the number of intervention arms and the number of possible comparisons [41], in addition to stratifcation and within-cluster correlation. Future research is warranted to develop sample-size methods for designing multi-arm stratifed CRTs. Missing data are common in pragmatic clinical trials due to reasons such as loss to followups, missing visits, or patient dropouts, and may attenuate statistical power if not appropriately adjusted. A crude adjustment of missing data in sample size estimation is to infate the sample size determined under the assumption of no missing data by the expected missing proportion. As noted in Chapter 3, such adjustment may fail to incorporate the impact of missing data on sample size and power, which largely depends on the missing pattern and correlation within the cluster. One interesting future direction is to incorporate missing data into sample size consideration for stratifed cluster randomized design. This chapter has focused on stratifed cluster randomized design with continuous and binary outcomes. Primary endpoints in pragmatic trials can also be categorical, count, or survival outcomes or longitudinal measurements. Future methodology development is needed to accommodate these outcomes in stratifed cluster randomized design. Despite the advantages of stratifcation in reducing baseline covariate imbalance and enhancing comparability between clusters in different treatment groups, stratifcation may have limitations in CRTs when there are multiple important baseline covariates to balance at the design phase because of the small number of clusters in most trials [42]. Stratifcation may create too many sparse strata with incomplete fllings, and the optimal number of strata is often diffcult to determine and would depend on the specifc context of a trial. To overcome challenges in balancing on multiple and possibly continuous covariates, covariate-constrained randomization has been developed for CRTs [43–47]. This approach involves selecting the cluster randomization scheme not from the pool of all possible randomizations, but from a smaller candidate set of randomizations where baseline covariates achieve suffcient balance according to a pre-specifed balance metric. Various balance metrics for using in the covariate-constrained randomization procedures in CRTs have been proposed [46, 47]. Interested readers can fnd more details about this approach in the aforementioned articles. As an alternative to parallel-arm cluster randomized design, stepped-wedge cluster randomized design is a one-directional crossover cluster randomized design, where all clusters receive both the control and intervention treatments [48]. At the initial period, all clusters receive the control treatment. At sequential time periods (a.k.a., steps), individual clusters or groups of clusters cross over to the intervention treatment based on a randomly assigned cluster order.

Stratifed Cluster Randomized Design for Pragmatic Studies

115

All clusters will receive the intervention at the end of the trial. Chapter 5 will discuss sample size estimations for stepped-wedge cluster randomized design for pragmatic studies.

References 1. A. Donner, and N. Klar. Pitfalls of and controversies in cluster randomized trials. American Journal of Public Health, 94(3):416–422, 2004. 2. A. Donner, M. Taljaard, and N. Klar. The merits of breaking the matches: A cautionary tale. Statistics in Medicine, 26(9):2036–2051, 2007. 3. N. Klar, and A. Donner. The merits of matching in community intervention trials: A cautionary tale. Statistics in Medicine, 16(15):1753–1764, 1997. 4. E.L. Turner, F. Li, J.A. Gallis, M. Prague, and D.M. Murray. Review of recent methodological developments in group-randomized trials: Part1-design. American Journal of Public Health, 107(6):907–915, 2017. 5. A. Donner, and N. Klar. Design and Analysis of Cluster Randomization Trials in Health Research. Arnold, 2000. 6. J.D. Lewsey. Comparing completely and stratifed randomized designs in cluster randomized trials when the stratifying factor is cluster size: A simulation study. Statistics in Medicine, 23(6):897–905, 2004. 7. M. Moerbeek, and S. Teerenstra. Power Analysis of Trials with Multilevel Data. Chapman and Hall/CRC, 2015. 8. J. Wang, S. Zhang, and C. Ahn. Power analysis for stratifed cluster randomisation trials with cluster size being the stratifying factor. Statistical Theory and Related Fields, 1(1):121–127, 2017. 9. A. Donner. Sample size requirements for stratifed cluster randomization designs. Statistics in Medicine, 11(6):743–750, 1992. 10. X. Xu, H. Zhu, and C. Ahn. Sample size considerations for stratifed cluster randomization design with binary outcomes and varying cluster size. Statistics in Medicine, 38(18):3395–3404, 2019. 11. L. Kennedy-Shaffer, and M.D. Hughes. Sample size estimation for stratifed individual and cluster randomized trials with binary outcomes. Statistics in Medicine, 39(10):1489–1513, 2020. 12. C. Rutterford, A. Copas, and S. Eldridge. Methods for sample size determination in cluster randomized trials. International Journal of Epidemiology, 44(3):1051– 1067, 2015. 13. S.J. Pocock, S.E. Assmann, L.E. Enos, and L.E. Kasten. Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: Current practice and problems. Statistics in Medicine, 21(19):2917–2930, 2002. 14. S.A. Lauer, K.P. Kleinman, and N.G. Reich. The effect of cluster size variability on statistical power in cluster randomized trials. Plos One, 10(4):e0119074, 2015. 15. L. Guittet, P. Ravaud, and B. Giraudeau. Planning a cluster randomized trial with unequal cluster sizes: Practical issues involving continuous outcomes. BMC Medical Research Methodology, 6:17, 2006.

116

Design and Analysis of Pragmatic Trials

16. B. Carter. Cluster size variability and imbalance in cluster randomized controlled trials. Statistics in Medicine, 29(29):2984–2993, 2010. 17. A.K. Manatunga, M.G. Hudgens, and S. Chen. Sample size estimation in cluster randomized studies with varying cluster size. Biometrical Journal, 43(1):75–86, 2001. 18. S.W. Raudenbush. Statistical analysis and optimal design for cluster randomized trials. Psychological Methods, 2(2):173–185, 1997. 19. S.H. Kang, C. Ahn, and S.H. Jung. Sample size calculation for dichotomous outcomes in cluster randomization trials with varying cluster size. Drug Information Journal, 37(1):109–114, 2003. 20. G.J. van Breukelen, M.J. Candel, and M.P. Berger. Relative effciency of unequal versus equal cluster sizes in cluster randomized and multicentre trials. Statistics in Medicine, 26(13):2589–2603, 2007. 21. S. Konstantopoulos. Computing power of tests of the variance of treatment effects in designs with two levels of nesting. Multivariate Behavioral Research, 43(2):327–352, 2008. 22. P.Z. Schochet. Statistical power for random assignment evaluations of education programs. Journal of Educational and Behavioral Statistics, 33(1):62–87, 2008. 23. H.S. Bloom, and J. Spybrook. Assessing the precision of multisite trials for estimating the parameters of a cross-site population distribution of program effects. Journal of Research on Educational Effectiveness, 10(4):877–902, 2017. 24. K.Y. Liang, and S.L. Zeger. Longitudinal data analysis using generalized linear models. Biometrika, 73(1):13–22, 1986. 25. C. Bradley (Ed.). Handbook of Psychology and Diabetes: A Guide to Psychological Measurement in Diabetes Research and Management. Hardwood Academic, 1994. 26. R.F. Woolson, J.A. Bean, and P.B. Rojas. Sample size for case-control studies using Cochran’s statistic. Biometrics, 42(4):927–932, 1986. 27. W.G. Cochran. Some methods for strengthening the common χ 2 tests. Biometrics, 10(4):417–451,1954. 28. J.A. Hanley, A. Negassa, and M.D.D. Edwardes. GEE analysis of negatively correlated binary responses: A caution. Statistics in Medicine, 19(5):715–722, 2000. 29. A. Donald, and A. Donner. Adjustments to the Mantel–Haenszel chi-square statistic and odds ratio variance estimator when the data are clustered. Statistics in Medicine, 6(4):491–499, 1987. 30. M.S. Ridout, C.G.B. Demetrio, and D. Firth. Estimating intraclass correlation for binary data. Biometrics, 55(1):137–148, 1999. 31. D.G. Bonett. Sample size requirements for estimating intraclass correlations with desired precision. Statistics in Medicine, 21(9):1331–1335, 2002. 32. R.M. Turner, R.Z. Omar, and S.G. Thompson. Bayesian methods of analysis for cluster randomized trials with binary outcome data. Statistics in Medicine, 20(3):453–472, 2001. 33. A.C. Ionan, M.Y.C. Polley, L.M. McShane, and K.K. Dobbin. Comparison of confdence interval methods for an intra-class correlation coeffcient (ICC). BMC Medical Research Methodology, 14(1):121, 2014. 34. M. Crowder. On the use of a working correlation matrix in using generalised linear models for repeated measures. Biometrika, 82(2):407–410, 1995. 35. B.W. McDonald. Estimating logistic regression parameters for bivariate binary data. Journal of the Royal Statistical Society, Series. B, 55(2):391–397, 1993.

Stratifed Cluster Randomized Design for Pragmatic Studies

117

36. M.K. Parmar, J. Carpenter, and M.R. Sydes. More multiarm randomised trials of superiority are needed. Lancet, 384(9940):283–284, 2014. 37. P. Liu, and S. Dahlberg. Design and analysis of multiarm clinical trials with survival endpoints. Controlled Clinical Trials, 16(2):119–130, 1995. 38. S.H. Jung, and C. Ahn. K-sample test and sample size calculation for comparing slopes in data with repeated measurements. Biometrical Journal, 46(5):554–564, 2004. 39. S. Zhang, and C. Ahn. Sample size calculation for comparing time-averaged responses in K-group repeated-measurement studies. Computational Statistics and Data Analysis, 58(1):283–291, 2013. 40. S. Watson, A. Girling, and K. Hemming. Design and analysis of three-arm parallel cluster randomized trials with small numbers of clusters. Statistics in Medicine, 40(5):1133–1146, 2021. 41. G. Baron, E. Perrodeau, I. Boutron, and P. Ravaud. Reporting of analyses from randomized controlled trials with multiple arms: A systematic review. BMC Medicine, 11(1):84, 2013. 42. N.M. Ivers, I.J. Halperin, J. Barnsley, J.M. Grimshaw, B.R. Shah, K. Tu, et al. Allocation techniques for balance at baseline in cluster randomized trials: A methodological review. Trials, 13:120, 2012. 43. G.M. Raab, and I. Butcher. Balance in cluster randomized trials. Statistics in Medicine, 20(3):351–365, 2001. 44. L.H. Moulton. Covariate-based constrained randomization of grouprandomized trials. Clinical Trials, 1(3):297–305, 2004. 45. B.R. Carter, and K. Hood. Balance algorithm for cluster randomized trials. BMC Medical Research Methodology, 8:65, 2008. 46. E. de Hoop, S. Teerenstra, B.G. van Gaal, M. Moerbeek, and G.F. Borm. The “best balance” allocation led to optimal balance in cluster-controlled trials. Journal of Clinical Epidemiology, 65(2):132–137, 2012. 47. F. Li, Y. Lokhnygina, D.M. Murray, P.J. Heagerty, and E.R. DeLong. An evaluation of constrained randomization for the design and analysis of grouprandomized trials. Statistics in Medicine, 35(10):1565–1579, 2016. 48. M.A. Hussey, and J.P. Hughes. Design and analysis of stepped wedge cluster randomized trials. Contemporary Clinical Trials, 28(2):182–191, 2007.

5 The GEE Approach for SteppedWedge Trial Design

5.1 Introduction The current healthcare system in the United States has been criticized as being overly complex, expensive, and ineffcient [1], and pragmatic clinical trials (PCTs) have been considered a plausible solution that would support a transition to a system in which continual learning is possible [1–3]. The concept of PCTs dates back to 1967 when researchers started to ask “Does the intervention work when used in real-life practice condition?” [4]. PCTs usually require minimum exclusion criteria. They are normally used when there is a priori knowledge of the effcacy of the intervention. In addition, PCTs are often concerned with complex interventions and the control arms typically receive the standard care instead of placebo [5]. In summary, PCTs are designed as cost-effective, large-scale research studies to address issues that matter to patients, clinicians, and healthcare systems by comparing interventions in real-world populations [6]. The stepped-wedge cluster randomized trial design (henceforth denoted as the SW design) is particularly suitable to and has been frequently adopted by PCTs. The SW design is a special case of the cluster randomized trial design, with a unique feature that allows a controlled stepwise introduction of an intervention to a population [7, 8], making it suitable to examine the effectiveness of an intervention in real-life clinical practice. An SW trial involves multiple steps. At the initial step, all clusters start from the control treatment, and they are randomized to different sequences, each switching assigned clusters to the intervention at one of the subsequent steps. Outcomes are measured at every step. When an SW trial is completed, all clusters would have switched to the intervention. Figure 5.1 provides an illustration of the SW design. The horizontal axis indicates steps and the vertical axis indicates patient clusters (e.g., formed by physicians or clinics). In a randomly assigned order, the clusters gradually switch to intervention. The advantages of an SW trial include: (1) by switching all clusters to intervention at the end of study, it allows clinical researchers to avoid the ethical DOI: 10.1201/9781003126010-5

119

120

Design and Analysis of Pragmatic Trials

FIGURE 5.1 An illustration of an SW trial with T = 5 steps and S = 4 sequences. (Each cell represents a data collection point. Shaded and blank cells represent intervention and control, respectively.)

dilemma of withholding an effective treatment to the control arm, which has been encountered in traditional parallel trials. This feature also mitigates one of patients’ common concerns of being randomized to an inferior treatment when participating in a randomized trial. (2) It is more effcient because the evidence of effectiveness is provided by comparisons from two directions: vertical (cross-sectional) comparisons of clusters receiving intervention vs those receiving control at each step, and horizontal (longitudinal) comparisons of observations before and after switching to intervention within each cluster. (3) Logistically it is more manageable because clusters switch from control to intervention in a stepwise fashion, putting less stress on the operation of healthcare systems. This is especially true for trials with complicated interventions that involve IT and procedural changes. (4) Outcomes are measured multiple times over the steps, offering insights into the temporal trend of intervention effect. The disadvantages, however, include longer study duration, higher risk of attrition, and diffculty in blinding. SW trials present various challenges to biostatisticians. For example, the proportion of missing data is usually higher than that in traditional parallel trials because SW trials involve longitudinal follow-up over a longer study duration. The internal trend of intervention effect as well as natural disease progression should be accounted for in both design and analysis. Furthermore, the correlation structure is much more complicated than that in traditional designs. Besides intracluster correlation (ICC), the longitudinal correlation among observations across steps usually cannot be ignored. There are three major types of SW trials [9, 10]. The frst is called closedcohort where all participants are followed throughout the study hence longitudinal measurements are collected from the same cohort of individuals. Each participant receives both control and intervention. The second is crosssectional where new panels of patients are enrolled at every step so that measurements are obtained at discrete time points from distinct individuals. A participant only receives one treatment: control or intervention. The third is a mixture of the above two, called open-cohort. Specifcally, some patients are followed through the whole study, but the design also allows some patients

The GEE Approach for Stepped-Wedge Trial Design

121

to leave while others to join at interim steps. For example, in a review of SW trials conducted during 2010–2014 [10], among the 37 studies identifed, 11 were categorized as closed-cohort, 13 cross-sectional, and 11 open-cohort. A similar distribution was reported in a more recent review [11]. In this and the next chapters, we discuss sample size calculation for SW trial design in PCTs. A fundamental rule in sample size calculation is to align the model assumed in sample size calculation with that to be used in data analysis. Violation of this rule can lead to what Kimball [12] called the third kind of errors in statistical consulting. That is, getting the right answer to a wrong problem. For the analysis of clinical trials with correlated outcomes, two alternative approaches, the generalized estimating equation (GEE) and mixed-effect model approaches, have been most widely employed. Both can accommodate different types of outcome variables, including continuous, binary, and count variables. The mixed-effect model approach constructs analytic models based on likelihood functions from the exponential family. Dependence among observations is accounted for through the specifcation of random effects. Statistical inference is usually obtained using the maximum likelihood method or a variant called the restricted maximum likelihood. More theoretical details can be found in Laird and Ware [13] and Jennrich and Schlucter [14]. The GEE approach, however, does not require fully specifying the likelihood function of clustered or longitudinal observations [15]. It utilizes estimating equations that are shown to provide consistent estimators of regression parameters and their variances if the marginal mean model is correctly specifed. In other words, the GEE approach is robust against misspecifed “working” correlation. One important difference between the mixed-effect model and GEE approaches lies in the specifcation of correlation. To account for correlation, the former includes random effects into the linear predictor model, which is then associated with the mean through a link function. Under such a setup, the correlation structure modeled through random effects applies to the linear predictor, not to the observed outcomes. For example, in a mixed-effect logistic model for a binary outcome, suppose an AR(1) structure is assumed for the random effects. This AR(1) structure is actually specifed on the logit scale. The correlation structure of the original binary outcome does not have a closed form. The GEE approach, on the other hand, directly models the correlation structure of the outcome variable. Due to this difference, except for continuous outcomes, a GEE-based sample size method usually could not fnd a mixed-effect model based counterpart under exactly identical conditions, even if the assumptions seem the same (for example, both assuming the AR(1) structure). We will review sample size solutions for SW trials developed based on both GEE and mixed-effect model approaches, for cross-sectional, closedcohort, and open-cohort trials. Various design issues encountered in PCTs will be investigated, including different types of outcomes (continuous, binary, and count), unbalanced randomization, arbitrary patterns and probabilities of missing data, complicated correlation structures, and randomly

122

Design and Analysis of Pragmatic Trials

varying cluster sizes. Because of complicated correlation structures and the requirement for multiple measurements over a longer duration, appropriately addressing missing data in sample size calculations is particularly relevant to SW trials. Under certain special cases, closed-form sample size formulas can be derived, which are easy to use and allow theoretical evaluation of the impact of different design factors. By modifying the vector of indicators for treatment sequences, the same theoretical framework is readily applicable to other multi-period cluster randomized trials, such as longitudinal and crossover cluster randomized trials. Finally, the stepwise crossover to intervention in SW trials presents great potential for implementing various adaptive design methods. Adaptive designs enable researchers to perform analyses on accumulating data in mid-trial, based on which the trial can be stopped early in case of overwhelming effcacy or futility. Both Bayesian and frequentist adaptive strategies for SW trials are investigated.

5.2 A Brief Review of GEE In a longitudinal study, let yij be the outcome measurement obtained from the ith ( i = 1, ˜ , n ) subject at the jth ( j = 1, ˜ , m ) time point, and x ij = (xij1 , ˜ , xijp )¢ be the corresponding time-dependent covariates. Defne y i = (yi1 , ˜ , yim )¢ to be the subject-specifc vector of outcome measurements and assume y i to be independent across subjects. The estimating equations are constructed based on the frst two moments of y i . The model for the frst moment (expectation) is

mij = E( yij ) = g( x’ij b ),

(5.1)

where b = ( b1 , ˜ , b p )¢ contains the regression coeffcients, depending on the study design, quantifying the effect of treatments, temporal trend, demographic and clinical factors, etc. The linear predictor x ij’b is associated with the mean through the link function g(×) . With different specifcations of g(×) , the GEE approach can accommodate different types of outcome variables, such as continuous, binary, and count. For the second moment, the variance of y i is modeled by 1

1

Wi = y Si2 R(x)Si2 .

(5.2)

Here y denotes the dispersion parameter, Si =diag {s( mi1 ), ˜ , s( mim )} is a diagonal matrix with a known function s(×) , and Var ( yij ) = y s( mij ) . The matrix R(x) is known as the working correlation matrix, indicated by the parameter x . The main interest lies in the statistical inference about b . It has

The GEE Approach for Stepped-Wedge Trial Design

123

been shown that the large sample distribution of GEE estimator bˆ does not require knowing the true value of y . The GEE approach is fexible to accommodate different types of outcome variables. For example, by specifying g(×) to be the identity function, g( xij’b ) = xij’b , s(×) = 1 , and y to be the residual variance, we have a linear regression model for a continuous outcome. To perform logistic regression for a binary outcome, we can specify g(×) to be a logistic function, s( mij ) = mij (1 - mij ), and y = 1 . To perform Poisson regression for a count outcome, we can specify g(×) = log(×) , s(×) to be the identity function, and y = 1 . The GEE approach does not require full specifcation of the likelihood function. For statistical inference, only models for the frst two moments, Models (5.1) and (5.2), are needed. The GEE estimate bˆ is obtained as solution to the general estimating equation n

åD¢ W i

i

-1

[ yi - mi ( b )] = 0,

(5.3)

i=1

where mi ( b ) = ( mi1 , ˜ , mim )¢ and Di = ¶mi ( b )/ ¶b is an m ´ p gradient matrix. Equation (5.3) usually does not have a closed-form solution and numeric methods such as the Newton–Raphson algorithm [16] have been employed. One attractive feature of the GEE method is that, as long as the mean model, mij = g(xij’b ) , is correctly specifed, the estimated bˆ remains consistent even if the specifed working correlation R(x) deviates from the true correlation, denoted by R0 . To model dependence among observations from the same subjects, many different models can be specifed for R(x) . For example, the compound symmetry (CS) model assumes the correlation between observations to be constant regardless of their temporal distance. The AR(1) model assumes decaying correlation over time. Both have been widely used. Furthermore, taking advantage of the robustness of GEE methods against misspecifed working correlation, researchers have used independent working correlation. As an instrument to facilitate estimation, independent working correlation ( R(x) = Im , where Im is an m ´ m identity matrix) has been frequently employed in the analysis of dependent data based on GEE. It can simplify theoretical derivation and reduce computational burden [17]. Researchers have also demonstrated that the estimated bˆ under independent working correlation is highly effcient in many situations [15, 18]. In practice, the true correlation R0 is usually unknown. Hence, specifying an independent working correlation affords a reasonable solution to obtain consistent estimators. In addition, computation under independent working correlation is less intensive and the iterative estimating algorithms usually perform more

124

Design and Analysis of Pragmatic Trials

stably (converge faster). Correct specifcation of R(x) surely improves effciency. As a pragmatic solution, however, the independent working correlation has been frequently employed due to various considerations including lack of prior knowledge about the true correlation R0 , theoretical derivation, computational burden, and stability of estimation. It can be shown that as n ® ¥ , under mild regularity conditions, n( bˆ - b ) approximately follows a normal distribution with mean 0 and variance å n . The variance is computed by a robust sandwich variance estimator, å n ( bˆ ) = An-1( bˆ )Vn ( bˆ ) An-1( bˆ ) , with An-1( bˆ ) =

n

åD¢ (bˆ )W i

i

-1

( bˆ ) Di ( bˆ )

i=1

and Vn ( bˆ ) =

n

åD’ (bˆ )W i

i

-1

( bˆ )Wi* ( bˆ )Wi -1( bˆ ) Di ( bˆ ).

i=1

Here Wi* ( bˆ ) = eˆ Ä2 with eˆi = yi - mi ( bˆ ) being the vector of estimated residual i Ä2 errors and u = uu’ for a vector u . An-1( bˆ ) is the model-based variance estimator, which is valid under the correct specifcation of correlation, R(x ) = R0 . On the other hand, it can be shown that the robust variance estimator å n ( bˆ ) is identical to the model-based variance estimator An-1( bˆ ) under R(x ) = R0 . This sandwich estimator is justifed by a Taylor series expansion (or the Delta method) argument [19, 20]. Liang and Zeger [15] showed that the Taylor series justifcation can be applied to different correlation structures for outcomes within the exponential family. So far we have described the GEE approach in the context of longitudinal studies, where each subject contributes observations measured at multiple time points. It should be pointed out that the GEE approach is generally applicable to the analysis of dependent data, including longitudinal and clustered observations, as well as data with hierarchical correlation structures. For example, in a closed-cohort SW trial, the subjects are clustered, while each of them contributes multiple measurements over the steps. As a result, both trial design and data analysis need to appropriately account for this hierarchical correlation structure. Statistical inference regarding b usually takes the form of hypothesis testing, H 0 : Hb = h0 vs H1 : Hb ¹ h0 . Here H is a q ´ p matrix of rank q £ p and h0 is a conformable vector of constant terms. Recall that p is the number of covariates. The Wald test statistics [21, 22] has been widely used in statistical inference. Its basic idea is to assess constraints on model parameters based on the weighted distance between the unrestricted estimate and its

The GEE Approach for Stepped-Wedge Trial Design

125

hypothesized value under the null hypothesis, with the weight being precision of the estimate. Mathematically, it is computed by -1

W = n( H bˆ - h0 )¢ é H å n ( bˆ )H ù ( H bˆ - h0 )¢. ë û Asymptotically W follows a non-central chi-square distribution with q degree of freedom, denoted by c 2q (l n ) . The non-centrality parameter l n is calculated by l n » n(H b - h0 )¢[ H å 0 H]-1(Hb - h0 ), with å 0 being the limit of å n ( bˆ ) as n ® ¥. It is obvious that l n = 0 under the null hypothesis H 0 . That is, the null distribution of test statistics W is the central chi-square distribution, denoted by c 2q (0) . In hypothesis testing, at a signifcance level of a , we reject the null hypothesis H 0 if W > c1-a , where c1-a is the 100(1 - a) th percentile of the c 2q (0) distribution. Furthermore, the sample size required to achieve power 1 - g at the signifcance level a can be solved from the following equation:

ò

¥

c1-a

dF(q, ln ) = 1 - g .

Here F(q, ln ) denotes the cumulative distribution function of the non-central chi-square distribution c q2 (ln ) . We have presented the Wald test as testing jointly multiple ( q > 1 ) hypotheses on single/multiple parameters. In the special case of testing a single hypothesis ( q = 1 ), the Wald test statistics W is equivalent to a z statistics, z=

n(H bˆ - h0 ) , H å n ( bˆ )H

which follows the standard normal distribution under H 0 .

5.3 Design SW trials with a Continuous Outcome In the following, we present sample size calculation for SW trials within the context of pragmatic trials (PCTs). Based on the GEE approach, we can derive a closed-form sample size solution for SW trials with a continuous outcome. Through different specifcations of correlation structures, it can accommodate the closed-cohort or cross-sectional SW trials. We also demonstrate that this sample size approach is fexible to address pragmatic issues such

126

Design and Analysis of Pragmatic Trials

as missing data due to various mechanisms, complicated correlation structures, or arbitrary crossover schemes [23]. We frst consider a closed-cohort SW trial. Suppose this trial involves n clusters and T steps. Each cluster is randomly assigned to one of the S sequences ( S = T - 1 ). Let Yijt be the continuous response obtained at Step t ( t = 1, … , T ) from Subject j( j = 1,…, J ) of the ith cluster (i = 1,… , n) , where J is the number of subjects in each cluster. Thus, N = nJ is the total number of unique subjects. A linear model is assumed for Yij = ( Yij1 ,…, YijT ) : E ( Yij ) = l + uiz = Xi b ,

(5.4)

where l = ( l1 ,…, lT ) ’ contains time-specifc intercepts which imply a fex’ ible model for the temporal trend; ui = ( ui1 ,…, uiT ) represents the sequence of treatments received by Cluster i, with uit = 0 /1 denoting control/intervention at Step t; and z is a scalar parameter representing the intervention effect, which is assumed to be constant over time. We defne Xi = ( IT , ui ) to be the subject-specifc design matrix where IT denotes a T ´ T identity

(

matrix. Hence, b = l ’ , z

)



is the vector of regression coeffcients. The ran-

domization scheme is defned by {(vs , ps ), s = 1, ˜ , S} , with

and

å

S

v1

=

v2

= °

vS

=

( 0,1,1, … ,1,1) , ’ ( 0, 0,1, … ,1,1) , ’

(5.5)

( 0, 0, 0, … , 0,1) , ’

ps = 1. The defnition of vs indicates that all clusters start with the

s=1

control treatment, and the switchover to intervention occurs at Step (s + 1) . A cluster is randomly assigned to receive treatment sequence vs with probability ps. That is, P(ui = vs ) = ps . Under the above specifcation, the proportion of subjects receiving intervention at Step t is ut =

å

S

psvst . Here vst is the tth

s=1

element of vs . Accordingly, we defne vector u = ( u1 ,…, uT ) . It is noteworthy that S = T - 1 is the maximum number of unique treatment sequences under the stepwise transitioning scheme over T steps. In practice, due to various reasons, researchers might decide to skip treatment transition at certain steps. We can accommodate such scenarios by setting ps = 0 if no treatment transition is planned at the ( s + 1 )th step. We assume Var ( Yijt ) = s 2 . For each cluster, we use Corr ( Yij ) = W to denote the longitudinal (or within-subject) correlation and Corr ( Yij , Yij¢ ) = F for

j ¹ j¢ to denote the between-subject correlation within Cluster i. We assume the subjects to be independent across clusters. Hence, W and F fully specify the correlation structure of Yi = Yi1’ ,…,YiJ’ . Specifcally,

(

)

127

The GEE Approach for Stepped-Wedge Trial Design

Corr ( Yi ) = I J Ä ( W - F ) + ( 1 J 1’J ) Ä F F.

(5.6)

Here Ä indicates the operation of Kronecker product, I J denotes a J ´ J identity matrix, and 1 J is a vector of length J with all elements being 1. The primary interest is to test the null hypothesis H 0 : z = 0. Considering that, when conducting data analysis, the true correlation is usually unknown, we assume that researchers adopt a practical solution: estimating the model parameters based on GEE using an independent working correlation structure. The estimates remain consistent but the computation is much simpler and more stable in terms of convergence [17]. Furthermore, an independent working correlation has been successfully employed in sample size calculation based on GEE [24–27]. The GEE estimator of b is æ bˆ = ç ç è

ö J X Xi ÷ ÷ i=1 ø n

å

-1

’ i

æ ç ç è

J

n

ö

ååX Y ÷÷ø . ’ i ij

i=1 j=1

Under mild regularity conditions [28], as n ® ¥ , n( bˆ - b ) ® N ( 0,å å n ) in distribution. The covariance matrix å n can be consistently estimated by An-1 En An-1 , where n

An = n-1

åJ X X ’ i

i

i=1

and n

En = n

-1

æ ç ç è

åå i=1

Ä2

ö X eˆ ÷ . ÷ j=1 ø J

’ i ij

Here eˆij = Yij - Xi bˆ is the estimated residual error and cÄ2 = cc’ for a vector c . The null hypothesis H 0 : z = 0 is rejected if n |ˆ zˆ |/sˆ z > z1-a /2 , where zˆ is the (T + 1) th element of bˆ , sˆ z2 is the ( T + 1, T + 1) component of å n , and z1-a /2 is the 100 ( 1 - a / 2 ) th percentile of the standard normal distribution. What we have described so far is effectively a statistical analysis plan (SAP) for an SW trial using the GEE approach. In the following, we present how to conduct power analysis for this SAP. Suppose that under the alternative hypothesis H1 , the true intervention effect is z = z 0. To reject the null hypothesis H 0 : z = 0 with power 1 - g at a two-sided signifcance level a , the required number of clusters n can be solved from æ |zˆ | ö Pç > z1-a /2 |H1 ÷ = 1 - g . ç sˆ z / n ÷ è ø

(5.7)

128

Design and Analysis of Pragmatic Trials

Equation (5.7) can always be solved by numerical search. In order to obtain closed-form sample size solutions, we replace zˆ and sˆ z with their limits to facilitate derivation. It is straightforward that under H1 , zˆ ® z 0 as n ® ¥. Next, we present how to derive the limit of sˆ z , denoted by s z . Let A and E be the limits of An and En as n ® ¥ , then å n converges to å = A-1 EA-1 . Specifcally, we have S

A= J

åp W W ’ s

s

s

s=1

and S

E = s 2J

åp W éëW + ( J - 1) F ùû W , ’ s

s

s

s=1

where Ws = ( IT , vs ) . It can be shown that the (T + 1, T + 1) component of å is

s = 2 z

s2

å

S s=1

ps ( vs - u ) éëW + ( J - 1) F ûù ( vs - u ) ’

J éê ë

å

(

)

ut 1 - ut ùú t=1 û T

2

.

Plugging z and s z into (5.7), the estimated number of clusters is

( z1-a/2 + z1-g ) s 2 2

n=

S

ë + ( J - 1) F ûù ( v - u ) åp ( v - u ) éW ’

s

s

s

s=1

ù é T z 02 J ê ut 1 - ut ú êë t=1 ûú

å (

)

2

.

(5.8)

Details of the derivation are presented in Appendix A. We have a few observations about sample size (5.8). First, by introducing notations of ps and vs, sample size (5.8) provides great fexibility for researchers to account for arbitrary switchover schedules and unbalanced randomization encountered in real-world SW trials. Furthermore, it is noteworthy that time-specifc intercepts ( l ) are not involved in Equation (5.8). That is, the underlying temporal trend approximately has negligible impact on sample size requirement for SW trials with continuous outcomes. Because all clusters receive control at Step 1 and experimental intervention at Step T, it is obvious that the frst elements of vs ( s = 1, ˜ , S ) and u both take value 0, and the last elements both take value 1. In turn, the frst and last elements of (vs - u) equal 0 for s = 1, ˜ , S. From this observation, we have the following fact:

129

The GEE Approach for Stepped-Wedge Trial Design

Fact 1 Defne Yij* = (Yij2 , ˜ , Yij(T -1) )¢ , W * =Corr (Yij* ) , and F * =Corr(Yij* , Yij¢* ) , where j ¹ j¢ . The correlation matrices W and F affect sample size (5.8) only through sub-matrices W * and F * . Fact 1 is particularly meaningful for special scenarios where T = 3 or T = 4 . Under T = 3 , between-subject correlation F affects sample size only through Corr(Yij2 , Yij¢2 ) , while within-subject correlation W has no effect. Under T = 4 , within-subject correlation W affects sample size only through Corr(Yij2 , Yij3 ) . Hence, whether W has a CS or AR(1) correlation structure will lead to the same sample size as long as Corr(Yij2 , Yij3 ) remains the same. For reference, the expressions of sample size (5.8) under T = 3 and T = 4 are presented below: • When T=3, v2 - u = ( 0 -p1 é1 ê W = êa êëb

a 1 c

we have 0 ) . Without

bù é r11 ú ê c ú and F = ê r12 êë r13 1úû

v1 - u = ( 0 1 - p1 0 ) and loss of generality, suppose

r12 r22 r23

r13 ù r23 úú . Then the required r33 ûú

number of clusters is

=

n

=

( z1-a/2 + z1-g )

2

s2

å

S

ps ( vs - u ) éë W + ( J - 1) F Fùû ( vs - u ) s=1

z 02 J éê ë

( z1-a/2 + z1-g )

2



å

(

)

ut 1 - ut ùú t=11 û T

s 2 éë1 + ( J - 1) r22 ùû

z 02 Jp1 ( 1 - p1 )

2

.

Among all the elements of W and F , only r22 =Corr(Yij2 , Yij¢2 ) is involved in sample size calculation. • When T =4, we have v1 - u = ( 0 1 - p1 p3 v2 - u = ( 0 -p1 p3 0 ) ’ , and v3 - u = ( 0 -p1 p3 - 1 The required number of clusters is

n

=

=

( z1-a /2 + z1-g )

2

s2

å

S s=1

z 02 J éê ë

( z1-a /2 + z1-g )

2

0)’ , 0)’ .

ps ( vs - u ) éë W + ( J - 1) F ùû ( vs - u )

å



(

)

ut 1 - ut ùú t=11 û T

2

s 2 éë p1 ( 1 - p1 ) w11 + 2p1p3 w12 + p3 ( 1 - p3 ) w22 ùû z 02 J éë p1 ( 1 - p1 ) + p3 ( 1 - p3 ) ùû

2

,

130

Design and Analysis of Pragmatic Trials

é w11 where ê ë w12

w12 ù = W * + ( J - 1) F * with Yij* = (Yij2 , Yij3 )¢ , W * =Corr (Yij* ) , and w22 ûú

F * =Corr(Yij* , Yij*¢ ) , for j ¹ j¢ . In practice, many SW trials adopt a balanced randomization scheme. That 1 is, p1 = ˜ = pS = . The sample size (5.8) under balanced randomization can S be simplifed to 36 ( z1-a /2 + z1-g ) Ss 2 2

n=

S

Fùû ( v - u ) ë W + ( J - 1) F å ( v - u ) éW ’

s

s=1

(

z 02 J S2 - 1

s

)

2

.

(5.9)

The matrices W and F represent within-subject and between-subject correlation, respectively. Each can be arbitrarily specifed. This setup provides great fexibility to account for complicated correlation structures arising from SW design. Alternatively, the specifcation of W and F can be implied by the direct specifcation of Corr (Yi ) for the whole cluster. For example, a block-CS structure has been described by Srivastava et al. [29] where F = cW W with a scaling parameter 0 < c < 1 . This specifcation assumes the between-subject correlation to be proportional to within-subject (longitudinal) correlation. The within-subject correlation W can be of arbitrary structures, such as CS (assuming within-subject correlation to be constant over time) or AR(1) (assuming within-subject correlation to decay with temporal distance). Another approach to specifying W and F is motivated by linear mixedeffect models. For example, suppose the model includes cluster and patient random effects, yijt = E( yijt ) + x i + hij + e ijt , where E( yijt ) is defned in Model (5.4), x i and hij are zero-mean cluster and subject random effects with variances s x2 and s h2 , respectively, and e ijt is the residual error term with variance s e2 . Then we will have s 2 = s x2 + s h2 + s e2 , W is a compound symmetry (CS) correlation matrix with off-diagonal elements being r1 + r2 and F = r2 11¢ , with r1 = (s x2 + s h2 )/ s 2 and r2 = s h2 / s 2 . This structure has also been termed the nested exchangeable correlation structure. Under this setting, the sample size formula can be simplifed to 3 ( z1-a /2 + z1-g ) Ss 2 éë( S - 2 ) r1 + S ( J - 1) r2 + 2ùû 2

n=

(

z 02 J S2 - 1

)

.

The derivations of (5.9) and (5.10) are presented in Appendix B.

(5.10)

The GEE Approach for Stepped-Wedge Trial Design

131

It is noteworthy that sample size (5.10) is a linear function of r1 and r2 . Equation (5.10) also shows that the coeffcient of r1 is S - 2 , which is much smaller than the coeffcient of r2 , S( J - 1) , under moderate cluster sizes. Hence, the between-subject correlation r2 has a greater impact on sample size than within-subject correlation r1 . Figure 5.2 graphically explores the relationship between the required number of clusters (n) and correlation coeffcients r1 and r2 . Specifcally, we see that for a given r1 , a change of r2 from 0 to 0.2 leads to large jumps in n. On the other hand, for a given r2 , a change of the same magnitude in r1 only leads to marginal changes in n. Finally, as cluster size J ® ¥ , it can be shown that sample size (5.10) approaches 3 ( z1-a /2 + z1-g ) S2s 2 r2 2

(

z 02 S2 - 1

)

.

This lower limit for the number of clusters as J ® ¥ suggests that, in practice, increasing enrollment within existing clusters (increasing J) cannot always compensate for the lack of unique clusters. In the above, we have presented sample size calculation for closedcohort SW-CRTs. The introduction of within-subject correlation W and between-subject correlation F also makes it convenient to accommodate cross-sectional SW trials. Under the cross-sectional SW design, the observations obtained at each step are contributed by different panels of subjects enrolled from the same clusters. If we follow the idea of a linear mixed-effect model and assume that observations from the same cluster share a cluster

FIGURE 5.2 Relationship between required number of clusters (n) and correlation coeffcients r1 and r2 . The other parameters are a = 0.05 , g = 0.2 , z 0 = 0.2 , s 2 = 1 , J = 20 , and T = 5 .

132

Design and Analysis of Pragmatic Trials

random effect, the resulting correlation structure would be characterized by W = 11’ r + (1 - r )I and F = 11’ r , as was considered by [8]. Here r is the intracluster correlation coeffcient (ICC). The required number of clusters is calculated by

( z1-a /2 + z1-g ) n=

2

2 T ì é T ü ù ï ï s ps í J r ê (vst - ut )ú + (1 - r ) (vst - ut )2 ý úû ïî êë t=1 ïþ s=1 t=1 . 2 é T ù 2 z 0 J ê ut ( 1 - ut ) ú êë t=1 úû S

2

å

å

å

å

Under the special case of ps =

(5.11)

1 for s = 1, … , S , it can be simplifed as S

3 ( z1-a /2 + z1-g ) Ss 2 [(JS - 2)r + 2] 2

n=

(

z 02 J S2 - 1

)

.

(5.12)

Details of derivation are presented in Appendix C. 5.3.1 Accounting for Missing Data Researchers frequently encounter the problem of missing data when conducting PCTs. For closed-cohort SW trials, this problem is particularly pronounced due to the need for repeated measurements (at multiple steps) from study participants over a prolonged follow-up. Here we demonstrate that the sample size (5.8) can be extended to accommodate missing data while maintaining a closed form. First, we introduce missing data indicators D ijt , which take values 0/1 for missed or observed measurement from the jth subject of the ith cluster at Step t. Under the assumption of missing completely at random (MCAR), An and En can be rewritten as J

n

* n

A =n

-1

ååX diag( D ) X ’ i

ij

i

i=1 j=1

and n

* n

E =n

-1

æ ç ç è

åå i=1

Ä2

ö D ij ˜ eˆ ij ) ÷ , X (D ÷ j=1 ø J

’ i

where D ij = ( D ij1 ,…, D ijT ) , diag ( c ) is a diagonal matrix with vector c being the diagonal elements, and ˜ indicates the operation of Hadamard product. Specifcally, for two matrices M 1 and M 2 of the same dimension, the (i , j) th component of M 1 ˜ M 2 is the product of the (i , j) th component of M 1 and ’

133

The GEE Approach for Stepped-Wedge Trial Design

the (i , j) th component of M 2 . We assume that P ( D ijt = 1) = d t . That is, we assume that the probability of an observation being missing only depends on its time of measurement. We further defne the joint probability of observing measurements at both Steps t and t¢ as P ( D ijt D ijt¢ = 1) = d tt¢ . Usually the occurrence of missing data increases over time, hence we assume d 1 ³ ˜ ³ d T . Given the same set of marginal missing probabilities, there can be different missing data patterns. For example, missed visits cause missing values to occur independently over time, which is called the independent missing (IM) pattern with d tt¢ = d td t¢ for t ¹ t¢ , and d tt = d t . On the other hand, patient dropout leads to a pattern such that once a patient misses a measurement, all his/her subsequent measurements are missing. This is called the monotone missing (MM) pattern with d tt¢ = d t¢ for t £ t¢ and d tt = d t . As n ® ¥ , An* ® A* and En* ® E * , where S

A* = J

åp W diag(d ) W s

’ s

s

s=1

and S

E* = s 2 J

åp W éëd˜ ° W + ( J - 1)diag(d ) FFdiag(d )ùû W . ¢ s

s

s

s=1

~

Here d = (d 1 ,…, d T ) , and d is a matrix with the ( t , t ) th element being d t and the ( t , t’ )th ( t ¹ t¢ ) element being d tt’ . Based on a derivation similar to that in Appendix A, it can be shown that the general sample size formula which accommodates missing data also has a closed form: ’

( z1-a /2 + z1-g ) n=

2

S

s2

åp ( v - u ) éêëd ˜ W + ( J - 1)diag(d ) FFdiag(d )úùû ( v - u ) ’

s

~

s

s

s=1

é z 02 J ê êë

ù d tut ( 1 - ut ) ú úû t=1 T

å

2

.

(5.13) In practice, researchers sometimes would like to evaluate testing power given a sample size n, which can be calculated by æ ç ç Pç Z < ç s çç è

ö ÷ ÷ t=1 - z1-a /2 ÷ , S ÷ ’ ps ( vs - u ) éëd˜ ° W + ( J - 1)diag (d ) F ag (d ) ùû ( vs - u ) Fdia ÷÷ s=1 ø T

å

n|z 0 |

å

d tut ( 1 - ut )

134

Design and Analysis of Pragmatic Trials

where Z is a random variable following the standard normal distribution. Finally, for a closed-cohort SW trial, the total number of subjects is jointly determined by the number of clusters (n) and the number of subjects per cluster (J). Given n, we can calculate the cluster size by S

åp ( v - u ) d˜ ° W ( v - u ) - C ’

s

J=

å

S

s=1

é nz 02 ê êë

s

s

2

ù -2 ut ( 1 - ut ) ú éë( z1-a //2 + z1-g )s ùû - C úû t=1 T

å

,

ps ( vs - u ) éë diag (d ) F Fdiag (d ) ùû ( vs - u ) . The above sample size formulas are obtained under the missing completely at random (MCAR) assumption [30]. Adopting this assumption leads to simplifed mathematical derivation and helps provide a universally applicable sample size solution. In practice, this MCAR assumption might not hold. In such cases, a separate model to account for the missing data mechanism needs to be constructed. This additional model increases the complexity of mathematical derivation and makes a closed-form sample size formula unlikely. Furthermore, because each trial has a unique missing data mechanism, it would be impossible to provide a general sample size solution. In these scenarios, power analysis is usually performed based on simulation studies. In practice, the missing data pattern might not be exactly independent (IM) or monotone (MM). It is more likely that different types of missing patterns are involved in a pragmatic trial. For example, it is likely that some measurements are missing due to missed appointments by accident (IM), while some are missing due to subjects dropout in the middle of the study (MM). We can accommodate such scenarios by specifying d t and d tt¢ as a mixture of IM and MM. For example, we can specify where C =



s=1

d t(MIX ) = wd t(IM) + (1 - w)d t(MM) and ) d tt(MIX = wd tt(IM) + (1 - w)d tt(MM) , ¢ ¢ ¢

where w is the anticipated proportion of missing values due to IM. 5.3.2 Simulation Research The derivation of closed-form sample size formulas in Sections 5.3 and 5.3.1 utilizes approximations, that is, by replacing sample estimators An* and En* with their limits A* and E * . Furthermore, the validity of inference by the

135

The GEE Approach for Stepped-Wedge Trial Design

GEE approach depends on the large sample theory. It is important to conduct simulation research to assess whether sample sizes calculated based on the GEE approach actually achieve desired operational properties (i.e., power and type I error), particularly in scenarios where sample sizes are moderate or small (or where the large sample theory might not hold). Simulation research is conducted to assess performance under a set of scenarios. Researchers usually specify a range of plausible values for each design parameter. Then combinations of design parameters form the set of scenarios that will be explored. For example, Table 5.1 presents simulation results for closed-cohort SW trials with a moderate cluster size J = 20 or a larger cluster size J = 40. The clusters are assumed to switch from control to intervention over T = 5 steps following a balanced randomization scheme. Hence, TABLE 5.1 Simulation results: required number of clusters (empirical power, empirical type I error) for closed-cohort SW trials Missing J 40

Pattern IM

GEE

d

r1 = 0.15

r1 = 0.30

r1 = 0.15

d1

28 (0.8136, 0.0764) 30 (0.8039, 0.0763) 29 (0.8092, 0.0812) 28 (0.8043, 0.0778) 31 (0.8201, 0.0738) 29 (0.8077, 0.0770) 28 (0.8120, 0.0710) 36 (0.8107, 0.0642) 41 (0.8043, 0.0668) 39 (0.8096, 0.0707) 37 (0.8017, 0.0730) 42 (0.8080, 0.0637) 40 (0.8160, 0.0688) 37 (0.8117, 0.0688)

29 (0.8146, 0.0789) 32 (0.8176, 0.0746) 30 (0.8099, 0.0716) 30 (0.8159, 0.0754) 32 (0.8123, 0.0732) 31 (0.8136, 0.0708) 30 (0.8275, 0.0788) 39 (0.8110, 0.0664) 44 (0.8064, 0.0634) 42 (0.8178, 0.0687) 40 (0.8078, 0.0605) 45 (0.8141, 0.0666) 42 (0.8111, 0.0664) 40 (0.8117, 0.0690)

30 (0.8024, 0.0532) 32 (0.7882, 0.0509) 31 (0.7944, 0.0499) 30 (0.7922, 0.0551) 33 (0.8079, 0.0516) 31 (0.7926, 0.0576) 30 (0.7918, 0.0556) 38 (0.7929, 0.0512) 43 (0.7887, 0.0516) 41 (0.7930, 0.0507) 39 (0.7971, 0.0503) 44 (0.7922, 0.0485) 42 (0.8079, 0.0533) 39 (0.7855, 0.0546)

d2 d3 d4 MM

d2 d3 d4

20

IM

d1 d2 d3 d4

MM

Adjusted GEE

d2 d3 d4

r1 = 0.30 31 (0.7930, 0.0539) 34 (0.7992, 0.0492) 32 (0.7951, 0.0508) 32 (0.8006, 0.0497) 34 (0.7932, 0.0528) 33 (0.8033, 0.0542) 32 (0.7975, 0.0530) 41 (0.7977, 0.0512) 46 (0.8062, 0.0492) 44 (0.7956, 0.0485) 42 (0.7996, 0.0517) 47 (0.7963, 0.0548) 44 (0.8038, 0.0544) 42 (0.7966, 0.0511)

136

Design and Analysis of Pragmatic Trials

p1 = ˜ = pS = 1/ 4 with S = T - 1 = 4 . The true intervention effect is set at z 0 = 0.2 and residual variance s 2 = 1 . The within-subject correlation matrix W is assumed to have a CS structure with off-diagonal elements rtt¢ = r1 ( t ¹ t¢ ) and the between-subject correlation matrix is specifed as F = 11¢ r2 . Within-subject correlation is usually strong, hence we explore two values for r1 : 0.15 and 0.3. For between-subject correlation, we assume r2 = 0.03 . Table 5.1 also investigates different missing patterns, which include independent missing (IM) and monotone missing (MM). For the vector of marginal observational probabilities d = (d 1 ,…,d T ) , four values are explored:

d1 d2 d3 d4

= = = =

(1.000,1.000,1.000,1.000,1.000 ) , (1.000,0.790,0.760,0.7730,0.700 ) , (1.000,0.925,0.850,0.775,0.700 ) , (1.000,1..000,1.000,0.800,0.700 ) .

Here d 1 represents the scenario of no missing data, while d 2 , d 3 , and d 4 represent scenarios where an increasing number of subjects miss visits over time with a common dropout rate of 0.3 at the end of study but following different temporal patterns. Under d 2 the observational probabilities drop quickly initially but plateau afterwards; under d 3 the observational probabilities drop following a linear trend; and under d 4 the observational probabilities remain high initially but drop quickly afterwards. The goal is to test the null hypothesis H 0 : z = 0 with power 1 - g = 0.8 at a two-sided type I error rate a = 0.05 . For each combination of design parameters, the performance of the sample size solution is evaluated based on the following algorithm: 1. Calculate the required number of clusters (n) for a given set of design parameters. 2. Generate a set of observations {Yi , i = 1, ˜ , n} based on simulation parameters under the null hypothesis H 0 : z = 0 or the alternative hypothesis H1 : z = z 0 . Yi’s are generated from a multivariate normal distribution with mean vector and covariance matrix implied by the design parameters. 3. Generate missing indicators under different missing patterns and probabilities s . ^

4. Calculate b and å n . The test statistic is Z = n |zˆ |/sˆ z . If Z > z1-a/2 , then set rejection indicator D = 1 ; otherwise D = 0 . 5. Repeat Steps 2–4 for L = 10000 times to obtain Dl for l = 1, … , L . 6. The empirical type I error and empirical powers are estimated by

å

L

Dl / L under the null and alternative hypothesis, respectively.

l=1

The GEE Approach for Stepped-Wedge Trial Design

137

Table 5.1 presents the calculated number of clusters as well as their empirical powers and type I errors under different design confgurations. We have several observations. First, it is straightforward that larger cluster sizes (J) lead to smaller numbers of clusters. Second, trials with greater dropout initially ( d 2 ) require larger sample sizes to compensate for information loss compared with other missing scenarios ( d 3 and d 4 ), although their fnal dropout rates are the same. Furthermore, under the MM missing pattern, missing data tend to concentrate on a few subjects, which leads to greater information loss and larger sample size requirement. We also observe that the existence of various types of correlations mitigates the impact of missing data in SW trials. Specifcally, with a dropout rate of 30% at the end of the study, the increase in sample size that is needed to compensate for missing data is at most 16.7% over all the settings considered. 5.3.3 Adjusting for Underestimated Variances for Small Sample Sizes There is one more issue that needs to be addressed. It has been widely reported that when the number of clusters is relatively small, the variance of treatment effect can be underestimated based on GEE, even if the total number of unique patients is large. This underestimated variance can result in an infated type I error. To address this problem of estimation bias under small numbers of clusters, researchers have proposed many methods, such as making correction on the GEE variance estimator [31–34], making inferences based on t-distribution or F-distribution, or combining different adjustment methods [35–38]. In the frst two columns of Table 5.1, we show results based on GEE estimators with no correction. We observe that the empirical powers and type I errors are both infated (greater than their nominal levels 0.8 and 0.05, respectively), a typical symptom of underestimated variance. After exploring various adjustment methods, we fnd that the combination of methods by Morel, Bokossa, and Neerchal [34] (MBN) and Donner and Klar [39] achieves a good balance that is easy to implement while properly maintaining the power and type I error at their nominal levels. Specifcally, Morel, Bokossa, and Neerchal [34] proposed to add a correction term to the GEE covariance estimator. Given our model setup, the adjusted variance estimator is å †n = å *n +Q n , where å *n = ( An* )-1 En* ( An* )-1 is the unadjusted general variance estimator that accounts for missing data, and the correction term is 1 T +1 ü ü ì ì Qn = min í0.5, trace[( An* )-1 En* ]ý ×(( An* )-1. ý × max í1, + n T 1 T 1 þ þ î î

138

Design and Analysis of Pragmatic Trials

It is noteworthy that trace[( An* )-1 En* ] equals the sum of eigenvalues of matrix ( An* )-1 En* , which has been referred to as the “generalized design effect” in [40]. It is easy to show that Qn approaches to zero as the number of clusters (n) increases. Donner and Klar [39] suggested adding one cluster per arm when the sample size is determined under 0.05 type I error. We present the simulation results based on the adjusted GEE in the last two columns of Table 5.1, which demonstrate that after correction, the empirical powers and type I errors are better maintained at their nominal levels of 0.8 and 0.05, respectively. 5.3.4 Consideration of Efficiency and Robustness The GEE sample size method offers fexibility to account for missing data and different correlation structures. As a GEE-based approach, it is also robust against deviation from distributional assumptions. To facilitate the derivation of closed-form sample size formulas, however, we have employed independent working correlation. Although statistical inference remains consistent, the estimated treatment effect can be less effcient than the one obtained when the working correlation is correctly specifed. Within the context of closed-cohort SW trials, we compare the GEE sample size method with that by Hooper et al. [41], which has been widely used. Hooper’s approach is developed based on linear mixed-effect models, Yijt = l t + uitz + ti + q ij + hit + eijt ,

(5.14)

where Yijt is the continuous response measured at Step t ( t = 1, ˜ , T ) from Subject j ( j = 1, ˜ , J ) of Cluster i ( i = 1, ˜ , n ). On the right-hand side, lt denote step-specifc intercepts; uit indicates the treatment received by Cluster i at Step t, with 0/1 denoting control/intervention; z is the intervention effect; t i ~ N (0, s t2 ) is the random effect of cluster; q ij ~ N (0, s q2 ) is the random effect of Subject j from Cluster i; hit ~ N (0, s h2 ) is the random effect of cluster-step interaction; and eijt ~ N (0, s e2 ) is the residual error term. To facilitate comparison with the GEE approach, we introduce the following correlation parameters:

r1 =

s t2 + s q2 , s2

r2 =

s t2 + s h2 , s2

r3 =

s t2 . s2

139

The GEE Approach for Stepped-Wedge Trial Design

Here s 2 = s t2 + s q2 + s h2 , r1 is the within-subject cross-step (longitudinal) correlation, r2 is the within-subject within-step (ICC) correlation, and r3 is the cross-subject cross-step correlation. Conversely, we have

s t2 = s 2 r3 , s q2 = s 2 ( r1 - r3 ), s h2 = s 2 ( r2 - r3 ), s e2 = s 2 (1 - r1 - r2 + r3 ). Using notations of the GEE sample size method, Model (5.14) leads to a correlation structure such that both the within-subject correlation W =Corr(Yij ) and between-subject correlation F =Corr(Yij , Yij¢ ) are of the CS structure. For W , the diagonal elements are 1 and off-diagonal elements are r1 , while for F the diagonal elements are r2 and off-diagonal elements are r3 . The overall correlation matrix is then defned according to Equation (5.6). Note that when setting s h2 = 0 , we have r2 = r3 and F = 11¢ r2 , and the corresponding GEE sample size formula is (5.10). Table 5.2 shows simulation results of the GEE sample size method and Hooper’s method (denoted by LMM) where data are generated under the normality assumption. It assumes the closed-cohort SW trial to have T = 5 steps with balanced randomization of S = 4 unique treatment sequences. For W , r1 = 0.1 and 0.2 are considered. For F F, it is assumed that F = 11¢ r2 and two values are considered for r 2 : 0.03 and 0.05. The cluster size is assumed to be moderate ( J = 20 ) or large ( J = 40 ). The true intervention effect is set at z 0 = 0.2 and marginal variance s 2 = 1 . The nominal levels of power and two-sided type I error are set at 0.8 and 0.05, respectively. Because Hooper’s TABLE 5.2 Required number of clusters (empirical power, empirical type I error) under the normality assumption J

r1 / r2

GEE method

LMM method

40

0.1/0.03 0.2/0.03 0.1/0.05 0.2/0.05 0.1/0.03 0.2/0.03 0.1/0.05 0.2/0.05

36 (0.7812, 0.0455) 37 (0.7876, 0.0476) 47 (0.7865, 0.0437) 48 (0.7921, 0.0486) 51 (0.7897, 0.0484) 53 (0.7828, 0.0520) 61 (0.7913, 0.0502) 64 (0.7934, 0.0505)

32 (0.7874, 0.0517) 32 (0.7936, 0.0536) 44 (0.7947, 0.0481) 44 (0.7976, 0.0535) 47 (0.7969, 0.0520) 46 (0.7891, 0.0546) 57 (0.7949, 0.0564) 58 (0.7974, 0.0532)

20

140

Design and Analysis of Pragmatic Trials

method does not account for missing data, only the scenario of complete observations is considered. Table 5.2 suggests that when the normality assumption holds (when random effects are indeed normally distributed), the GEE sample size method is less effcient (by 6% to 13%) than Hooper’s, although both methods maintain empirical powers and type I errors close to their nominal levels. It is noteworthy that under Hooper’s method, the treatment effect estimator based on linear mixed-effect models assumes the true correlation structure to be known. The GEE estimator of treatment effect, however, assumes the true correlation structure to be unknown and an independent working correlation is employed. Next, we evaluate performance of the two methods when the normality assumption is violated. The design parameters (means, variances, and correlation coeffcients) are assumed to be of the same values. The random effects ( ti ) and residual errors ( eijt ), however, are generated from shifted gamma (SG) distributions [42, 43]. We set the shape parameter to be 0.5, the scale parameter 2v , and the shifted location parameter v / 2 . Under the above specifcation, an SG random variable has mean 0 and variance v. Hence, different values of v can be specifed for t i and eijt so that their variances match those specifed under the normal model. The simulation results are presented in Table 5.3. First, we observe that the sample sizes remain the same as those in Table 5.2, because the sample size formulas do not explicitly account for distribution. As for empirical powers and type I errors, the GEE method still maintains them close to the nominal levels (0.8 and 0.05), but we observe large deviation under Hooper’s method. Specifcally, both empirical powers and type I errors are severely infated. This fnding is consistent with the theoretical conclusion that GEE-based methods only require correct specifcation of the frst two moments and do not require additional distributional assumptions. On the other hand, for sample size methods developed based on mixed-effect models, violation of model assumptions can adversely impact the ability of designed trials to achieve desired operational characteristics (power and type I error). In Figure 5.3 we present the density functions of the standard normal distribution and an SG distribution with a shape parameter of 0.5, scale parameter of 2 , and shifted location parameter of 1/ 2 . Both distributions have mean 0 and variance 1, but the shapes of their density functions are drastically different. The SG distribution is specifed to represent extreme deviation from normality. The GEE sample size method successfully maintaining the empirical powers and type I errors at their nominal levels is a testimony to its robustness against deviation from distributional assumptions. In summary, the GEE method provides a conservative sample size solution under ideal scenarios (where normality holds and the true correlation structures are known). However, it is advantageous in the following aspects: (a) not requiring the correlation structure to be known during inference; (b) being robust against deviation from distributional assumptions; (c) accounting for missing data and unbalanced randomization schemes; (d) having

141

The GEE Approach for Stepped-Wedge Trial Design

Densities of Normal and SG Distributions 1.4 Normal SG

1.2

Density

1.0 0.8 0.6 0.4 0.2 0.0 −3

−2

−1

0

1

2

3

x FIGURE 5.3 The density functions of the standard normal distribution and a shifted gamma distribution (SG) with shape parameter 0.5, scale parameter 2 , and the shifted location parameter 1/ 2 . Both distributions have mean 0 and variance 1.

closed-form sample size formulas, which is easy to use by practitioners with little statistical background and offers theoretical insights on the impact of various design factors. Hence, the GEE sample size method provides a useful solution to the design of SW trials in pragmatic settings (Table 5.3). TABLE 5.3 Required number of clusters (empirical power, empirical type I error) when the normality assumption is violated J

r1 / r2

GEE method

LMM method

40

0.1/0.03 0.2/0.03 0.1/0.05 0.2/0.05 0.1/0.03 0.2/0.03 0.1/0.05 0.2/0.05

36 (0.7881, 0.0418) 37 (0.7873, 0.0461) 47 (0.7949, 0.0469) 48 (0.7931, 0.0491) 51 (0.7914, 0.0481) 53 (0.7910, 0.0482) 61 (0.7937, 0.0456) 64 (0.7901, 0.0459)

32 (0.8933, 0.1666) 32 (0.9022, 0.1821) 44 (0.9236, 0.2481) 44 (0.9302, 0.2624) 47 (0.8707, 0.1033) 46 (0.8741, 0.1196) 57 (0.8891, 0.1433) 58 (0.9031, 0.1604)

20

142

Design and Analysis of Pragmatic Trials

5.4 Design SW Trials with a Binary Outcome The GEE approach is generally applicable to outcome variables within the exponential family. Wang et al. [44] showed that, following similar derivation as in Section 5.3, a closed-form sample size formula can be obtained for SW trials with binary outcomes. Suppose in a closed-cohort SW trial with T steps, n clusters are randomly assigned to S = T - 1 treatment sequences

å

with probability ps , s = 1, ˜ , S and

S

ps = 1. The number of clusters

s=1

randomized to receive the sth sequence is denoted by ns, with

å

S

ns = n.

s=1

Each cluster includes J subjects. We use Ysijt to denote the observed binary outcome from Subject j (j = 1, ˜ , J ) of Cluster i ( i = 1, ˜ , ns ) under the sth ( s = 1, ˜ , S ) sequence at Step t ( t = 1, ˜ , T ). The frst moment E(Ysijt ) = mst is modeled by æ mst ö log ç ÷ = lt + vstz , è 1 - mst ø where lt is the step-specifc intercept, vst is the treatment indicator with values 0/1 indicating control/intervention, and z quantifes the intervention effect, which is assumed to be constant over time. Defne vs = (vs1 , ˜ , vsT )¢ and {vs , s = 1, ˜ , S} to be the S unique treatment sequences involved in this study. The null hypothesis of interest is H 0 : z = 0. Utilizing the property of a binary outcome, we have Var (Ysijt ) = mst (1 - mst ). We defne Ysij = (Ysij1 , … , YsijT )¢ to be the vector of longitudinal observations from each subject, and Ysi = ( Ysi1’, … , YsiJ’) to be the collection of measurements from the (s, i) th cluster. Then W =Corr(Ysij ) is the within-subject (longitudinal) correlation matrix and F =Corr (Ysij , Ysij¢ ) is the between-subject correlation in Cluster (s, i) , and R = I J Ä ( W - F ) + 1 J 1’J Ä F is the correlation matrix of Ysi . Observations are assumed to be independent across clusters. Hence, the frst two moments of Ysi are fully specifed for the GEE approach. ’ Defne b = ( l1 ,…, lT , z ) to be the vector of regression coeffcients.

(

)

^

Utilizing an independent working correlation, the GEE estimator b can be solved from the equation U ( b ) = 0 using the Newton–Raphson algorithm, where U(b ) =

S

ns

J

åååX éëY ’ s

sij

- ms ( b )ùû

s=1 i=1 j=1

is the score function with ms = ( ms1 , … , msT ) , and Xs = ( IT , vs ) is the design matrix with v s = ( vs1 ,…, vsT ) . As n ® ¥ ,

^

n( b - b ) asymptotically follows

143

The GEE Approach for Stepped-Wedge Trial Design

a multivariate normal distribution with zero mean and covariance matrix å = A-1 EA-1 , where S

A= J

åp ( X G ) ’ s

s

Ä2

s

s=1

and S

E=J

Fùû G X . ë W + ( J - 1) F åp X G éW ’ s

s

s

s

s

s=1

Here Gs is a T ´ T diagonal matrix with the diagonal elements being mst (1 - mst ) for t = 1, ˜ , T . In practice, A and E can be estimated by ˆ= J A n

S

ån ( X Gˆ ) ’ s

s

Ä2

s

s=1

and Eˆ = n

ns

S

-1

æ ç ç è

ö X eˆ ÷ ÷ j=1 ø J

åå å s=1 i=1

Ä2

’ s sij

,

where eˆ sij = Ysij - mˆ s is the individual vector of residual error with ˆ is T ´ T diagonal with elements mˆ (1 - mˆ ) . mˆ s = ( mˆ s1 , … , mˆ sT )¢ , and G s st st Note that ^

mst =

^

exp(l t +vstz ) ^ ^ 1 + exp(l t +vstz ) ^

Let sˆ z2 be the ( T + 1, T + 1) th element of å = A-1 EA -1. Based on the test statistic z = n|zˆ |/sˆ z , given a true intervention effect z 0 , to reject the null hypothesis H 0 : z = 0 with a power of 1 - g at a two-sided signifcance level of a , the required number of clusters can be computed by

( z1-a /2 + z1-g ) n=

2

^

^ ^

S

åp ( v - a ) G [W + (J - 1)FF] G ( v - a ) s

s

s

s=1

é T z Jê ëê t=1 2 0

æ ç ç è

ù ö wst ÷ at ( 1 - at ) ú ÷ úû s s=1 ø S

åå

where wst = ps mst (1 - mst ) and a = ( a1 ,…, aT ) , with

s

2

s

,

(5.15)

144

Design and Analysis of Pragmatic Trials

S

åw v

st st

at =

s=1 S

åw

.

st

s=1

We can interpret at as a weighted proportion of subjects receiving intervention at Step t. Details of derivation are presented in Appendix D. Sample size (5.15) assumes that every subject contributes complete observations over all steps. For closed-cohort SW trials conducted in pragmatic settings, however, the occurrence of missing data in longitudinal follow-up is usually inevitable. Properly accounting for the impact of missing data in sample size calculation is vital to ensure a properly powered SW trial. To address this problem, we introduce a missing indicator D sijt , which takes value 1/0 if measurement Ysijt is observed/missing. We assume that the probability of a measurement being missing only depends on time and defne the marginal observational probability P ( D sijt = 1) = d t . To accommodate different missing data patterns, we also introduce the joint observational probability

(

)

P D sijt D sijt’ = 1 = d tt’ , which is the probability that a subject contributes observations both at Steps t and t¢ ( t ¹ t¢ ). For example, under the independent missing (IM) pattern, the occurrences of missing data are independent between t and t¢ , hence d tt¢ = d td t¢ . Under the monotone missing (MM) pattern, a subject with missing data at t would miss all subsequent observations, hence d tt¢ = d t¢ for t¢ > t. Under the MCAR assumption, as n ® ¥ , the limits of A and E are S

A* = J

åp X diag(d ) G G X s

’ s

s

s

s

s=1

and S

E* = J

åp X G éëdˆ ˜ W + ( J - 1)diag(d ) FFdiag(d )ùû G X , ’ s

s

s

s

s

s=1

respectively. Here ˜ indicates the operation of Hadamard product, diag (d ) ’ is a T ´ T diagonal matrix with diagonal elements being d = (d 1 ,…, d T ) , and dˆ is a T ´ T matrix with the ( t , t ) th diagonal element being dt and the ( t , t’ )th off-diagonal element being dtt’ . Then the generalized formula for the number of clusters that accommodates missing data is

( z1-a/2 + z1-g ) *

n =

2

S

åp ( v - a ) G ëéd˜ ° W + ( J - 1)diag(d ) FFdiag(d )ùû G ( v - a ) s

s

s

s

s=1

é T z 02 J ê êë t=1

æ ç ç è

ù ö wst ÷ dt at ( 1 - at ) ú ÷ úû s=1 ø S

åå

s

2

(5.16)

.

145

The GEE Approach for Stepped-Wedge Trial Design

Sample size (5.16) is fexible to account for various missing data patterns (through d˜ , d ), complicated correlation structures (through W, W F ), and unbalanced randomization (through ps and vs ). On the other hand, given the number of clusters n* and the true treatment effect z 0 , the testing power that can be achieved is æ ç ç PçZ < ç ç ç è

ö ÷ ÷ t=1 - z1-a /2 ÷ , S ÷ ps ( vs - a ) Gs éëd˜ °W + ( J - 1) diag (d ) Fdiag (d ) ùû Gs ( vs - a ) ÷ ÷ s=1 ø T

n* Jz 02

æ ç ç è

ö wst ÷ d t at ( 1 - at ) ÷ s=1 ø S

åå

å

where Z is a standard normal variable. The above sample size formulas are developed in the context of closedcohort SW trials. As was shown in Section 5.3, with different specifcations of correlation matrices W and F F, the same formulas can be employed to assess sample size requirement for cross-sectional SW trials. 5.4.1 Extension to Outcomes from the Exponential Family The GEE sample size calculation approach for SW trials is generally applicable to outcomes from the exponential family. Here we present a general framework. Suppose in a closed-cohort SW trial, we use Yijt to denote the outcome measurement obtained from the jth ( j = 1, ˜ , J ) subjects in ith cluster ( i = 1, ˜ , n ) at Step t ( t = 1, ˜ , T ). Without loss of generality, suppose Yijt follows a distribution from the exponential family. Defne Yij = (Yij1 , ˜ , YijT )¢ . We model the mean E(Yij ) = mi = ( mi1 ,˜, miT )¢ by link function g( mi ) , such that g( mi ) = l + uiz = Xi b . l = (l1 , ˜ , lT )¢ Here accommodates arbitrary temporal trends, ui = (ui1 , ˜ , uiT )¢ denotes the treatment sequence received by Cluster i, z l z )¢ is the vector of regression coefrepresents the treatment effect, b = (l’, fcients, and Xi = (IT , ui ) is the individual design matrix that is shared by all subjects within a cluster. Hence, mi = g -1(Xi b ) . We use {vs , s = 1, ˜ , S} to denote the S unique treatment sequences involved in the SW trial, where vs = (vs1 , ˜ , vsT )¢ and vst = 0 /1 indicate control/intervention at Step t. We assume that P(ui = vs ) = ps for s = 1, ˜ , S , with -1

å

S

ps = 1 . Accordingly, we

s=1

defne ms = ( ms1 , ˜ , msT )¢ = g (Ws b ) , where Ws = ( IT , vs ) . We defne Corr(Yij ) = W and Corr (Yij , Yij¢ ) = F for j ¹ j¢ . Then the correlation matrix of Yi = (Yi1’,˜, YiJ’)¢ is R =Corr (Yi ) = I J Ä (W W-F F) + (1J1 ’J ) Ä F.

146

Design and Analysis of Pragmatic Trials

The variance of Yijt is denoted by Var (Yijt ) = y h( mit ) , where y is the dispersion parameter and function h( mit ) represents the association of mean to variance. Specifc forms of f and h( mit ) depend on the distribution of Yijt . For example, under the normal distribution h( mit ) = 1 and Y equals the residual variance; under the Bernoulli distribution h( mit ) = mit (1 - mit ) and Y = 1 ; and under the Poisson distribution h( mit ) = mit and Y = 1 . Defne the diagonal matrix Bij =diag[ Yh( mi1 ),˜, Yh( miT )]. Based on the independent working correlation structure, the GEE estimator of b is obtained by solving the score function Un ( b ) = 0 , where ì ï ï ï ï ï Un ( b ) = í ï ï ï ï ï î

ü ï ï j=1 ï ˜ ï J ï ý. (YijT - miT ( b )) ï j=1 ï ï éë mi ( b )¢(Yij - mi ( b ))ùû ï ï þ J

n

åå(Y

ij1

i=1

n

- mi1( b ))

åå i=1 J

n

åå i=1 jj=1

Under mild regulatory conditions, as n ® ¥ , n( bˆ - b ) ® N(0, å ) in distribution. The covariance matrix å is consistently estimated by the GEE sandwich variance estimator å n = An-1 En An-1 , where An =

1 n

J

n

ååX’ B X ij

ij

ij

i=1 j=1

and 1 En = n

n

æ ç ç è

Ä2

ö Xi¢eˆ ij ÷ . ÷ j=1 ø J

åå i=1

Here eˆ ij = Yij - mi ( bˆ ) is the vector of estimated residual errors. Let A and E be the limits of An and En as n ® ¥. It is easy to show that S

A= J

åp W¢ B W s

s

s

s

s=1

and S

E=J

åp W¢ B s

s=1

s

1/2 s

[W W + (J - 1)F]Bs1/2Ws ,

147

The GEE Approach for Stepped-Wedge Trial Design

where Bs =diag[ Yh( ms1 ),˜, Yh( msT )]. Then å n converges to å = A-1 EA-1 . Given the true intervention effect z 0 , to reject the null hypothesis H 0 : z = 0 with power 1 - g at a two-sided signifcance level a , the required number of clusters is n=

(z1-a /2 + z1-g )2s z2 . z 02

(5.17)

Here s z2 is the (T + 1, T + 1) th component of å . We account for missing data using indicator D ijt (with value 0/1 indicating missing or observed measurements). Defne individual vectors of missing indicators D ij = (D ij1 ,˜ , D ijT )¢ . Under the MCAR assumption, An and å n can be rewritten as An* =

1 n

J

n

ååX diag(D )B X i¢

ij

ij

i

i=1 j=1

and 1 En* = n * n

*

n

æ ç ç è

Ä2

ö Xi¢ (D D ij ˜ eˆ ij ) ÷ . ÷ j=1 ø J

åå i=1

* n

As n ® ¥ , A ® A and E ® E * , where S

A* = J

åp W diag(d )B W s

s

¢

s

s

s=1

and S

E* = J

åp W ¢ ëéd˜ ° B s

s=1

s

1/2 s

W WBs1/2 + (J - 1)diag (d )Bs1/2FBs1/2 diag(d )ùû Ws .

Let s be the (T + 1, T + 1) component of å * = A*-1 E * A*-1 . The required number of clusters accounting for missing data can be calculated by 2* z

n=

(z1-a /2 + z1-g )2s z2* . z 02

In summary, under the GEE framework, the key task in sample size calculation for SW trials with outcomes from the exponential family is to obtain matrix Bs . All the other design factors, including design matrix and treatment sequences ( Ws ), randomization scheme ( ps ), correlation structures ( W and F ), and missing data ( d and d˜ ) remain unchanged in the sample size formula. For reference, the forms of Bs and relevant terms for different distributions are presented in Table 5.4.

148

Design and Analysis of Pragmatic Trials

TABLE 5.4 Bs and relevant terms for different distributions in exponential family

Distribution

Continuous

Binary

Count

Normal

Bernoulli

Poisson

g( mijt )

mijt

Var (Yijt )

s2

Bs

æ mijt log ç ç è 1 - mijt

log( mijt )

ö ÷÷ ø

mijt (1 - mijt )

s IT

é ms1 (1 - ms1 ) ê ° ê êë 0

2

˜ ˛ ˜

mijt

ù ú ° ú msT (1 - msT )úû 0

é ms1 ê ê ° êë 0

˜ ˛ ˜

0 ù ú ° ú msT úû

5.5 Longitudinal and Crossover Cluster Randomized Trials In SW trials with T steps, we have used vs = (vs1 , ˜ , vsT )¢ , s = 1, ˜ , S to denote the treatment sequences involved in the trial, where vst = 0 /1 indicates that clusters assigned to the sth sequence receive control/intervention at Step t. With stepwise crossover of treatment, an SW trial has S = T - 1 possible sequences (5.5). The randomization scheme is represented by {(vs , ps ), s = 1, ˜, S} , where ps is the probability of a cluster being randomized to sequence vs ,

å

S

ps = 1 . In sample size calculation based on GEE,

s=1

vs and ps are explicitly involved in the sample size formulas, for example, Equation (5.13). It is noteworthy that the mathematical derivation does not depend on the specifc values of {(vs , ps ), s = 1,˜ , S} . For example, with different specifcations of ps ’s, we can use (5.13) to assess sample size requirement for SW trials with arbitrary unbalanced randomization schemes. On the other hand, recognizing that vs ’s can take forms other than that of (5.5), we can generally apply sample size (5.13) to clinical trials that involve clustered randomization and longitudinal measurements, but not restricted to the SW design. The SW design is a special case of multi-period cluster randomized trial designs, which are conducted over a sequence of time periods such that within each period a treatment condition (conventionally “experimental” treatment or control) is applied to clusters of individuals on whom longitudinal measurements of outcomes are obtained. In the following, we present two other types of such designs which do not require a stepwise crossover of treatment from control to intervention. The frst type is called longitudinal cluster randomized trials (LCRT), where subjects are randomized to the intervention or the control arm, and each subject contributes multiple outcome measurements over the study period [45]. There is no planned

The GEE Approach for Stepped-Wedge Trial Design

149

switchover of treatment. A sequence in an LCRT contains only one treatment. The LCRT design has been widely employed in biomedical research. For example, Bosworth et al. [46] conducted an LCRT on blood pressure (BP) control at a Veterans Affairs (VA) Medical Center. Clusters of patients (by primary care providers) were randomly assigned to the control or the intervention arm. Under intervention, providers received computer-based decision support, which aimed to improve guideline-concordant medical therapy. Under control, providers received the standard reminder. The outcome is BP value, which was measured at each primary care visit per VA standard. The second type is called crossover cluster randomized trials (CCRT), where clusters of subjects are randomly assigned to sequences of treatments over the study period [47, 48]. It is similar to SW trials in that the sequences contain treatment crossover, hence each cluster receives more than one treatment, but the crossover does not follow a stepwise pattern. In Figure 5.4, the upper panel illustrates the scheme of the LCRT design. The lower panel shows the scheme of a CCRT design with two treatments. Clusters randomized to the frst sequence receive control in the frst period and then cross over to intervention in the second period, while clusters in the second sequence receive treatments in the opposite order. For example, Anderson et al. [48] conducted a CCRT with two periods to assess the effect of a lying-fat position compared with a sitting-up position on patients with acute stroke. Clusters of patients (by hospitals) were randomly assigned to two sequences. The frst sequence started with the lying-fat position and then switched to the sitting-up position. The second sequence employed the two positions in the reverse order. The outcome is a modifed Rankin scale score, which measures the degree of disability. It was collected from each patient under the two treatments. CCRTs with two periods have been most commonly used. Design and analysis of CCRTs with multiple periods, however, have also been investigated and employed in practice [49–51].

FIGURE 5.4 Diagrams of the LCRT and CCRT with two sequences. (Blank and shaded cells represent control and intervention, respectively.)

150

Design and Analysis of Pragmatic Trials

SW, LCRT, and CCRT can all be viewed as longitudinal studies embedded in cluster randomized trials. Such settings give rise to complicated correlation structures that involve: (1) correlation among repeated measurements from the same subjects (within-subject or longitudinal correlation), (2) correlation between concurrent measurements from different subjects within the same cluster (the conventional intracluster correlation, ICC), and (3) correlation between nonconcurrent measurements from different subjects within the same cluster (cross-subject cross-period correlation). Ignoring those correlations leads to biased estimation and improper trial designs. For trial design, several sample size estimation methods have been developed based on the GLMM. For example, Roy et al. [52] investigated sample size calculation for LCRTs using the generalized least squares estimates. Heo and Leon [53] presented a sample size method using the maximum likelihood estimate to evaluate intervention-by-time interaction in LCRTs. Giraudeau et al. [54] proposed a sample size method for CCRTs with continuous outcomes where they assumed the longitudinal and cross-subject cross-period correlations to be equal. It was developed for CCRTs with two periods under balanced randomization. Under the GEE framework, Liu et al. [55] derived sample size formulas for comparing two means, two slopes, and two proportions in LCRTs, which assume simple correlation structures (proportional shrinkage CS and AR). For pragmatic trials that involve longitudinal measurements, the issue of missing data is almost inevitable and needs to be accounted for in study design and data analysis [56]. Existing sample size methods for LCRT and CCRT, however, have not properly addressed this issue. We demonstrate that the GEE sample size formula (5.13) can be adapted to provide closed-form sample size solution for LCRTs and CCRTs, accounting for pragmatic design issues such as different missing data patterns, unbalanced randomization, arbitrary correlation structures, and randomly varying cluster sizes [57]. We consider a cluster randomized trial that involves T time periods, where n clusters are randomized to treatment sequence vk = (vk 1 , ˜ , vkT )¢ , k = 1, 2. Here vkt = 0 /1 indicates the treatment of control/intervention in Sequence k. Defne Yijt to be the continuous outcome measurement from the jth subject ( j = 1, ˜ , J ) of the ith cluster ( i = 1, ˜ , n ) during the tth period, and the marginal mean of Yij = (Yij1 , ˜ , YijT )¢ is modeled by E ( Yij ) = l + uiz = Xi b ,

(5.18)

where l = ( l1 ,…, lT ) are time-specifc intercepts, representing an arbitrarily shaped baseline trend; ui indicates the treatment sequence randomly assigned to Cluster i, with P(ui = v1 ) = p and P(ui = v2 ) = 1 - p ; and z quantifes the intervention effect. We use Xi = ( IT , ui ) to denote the design matrix shared by all subjects within Cluster i. Here IT is a T ´ T identity matrix.

151

The GEE Approach for Stepped-Wedge Trial Design

(

)



b = l ’ , z denotes the vector of regression coeffcients. The primary interest is to test H 0 : z = 0 vs H1 : z ¹ 0. For the second moment, we assume marginal variance Var (Yijt ) = s 2 , within-subject (longitudinal) correlation matrix W =Corr(Yij ), and between-subject correlation matrix F =Corr(Yij , Yij¢ ) , j ¹ j¢ . Then correlation matrix of the cluster-specifc outcome vector Yi = Yi1’ ,…,YiJ’ is

(

)

R =Corr(Yi ) = I J Ä ( W - F ) + ( 1 J 1’J ) Ä F F, and observations are assumed to be independent across clusters. The above model setup is very similar to that of (5.4). Through different specifcations of vk , however, the GEE sample size method is applicable to both LCRTs and CCRTs. For example, given T = 4 time periods, the LCRT in Figure 5.4 can be represented by v1 = (0, 0, 0, 0)¢ and v2 = (1,1,1,1)¢, whereas the CCRT can be represented by v1 = (0,1, 0,1)¢ and v2 = (1, 0,1, 0)¢ . Following similar derivation as in Section 5.3, we can obtain a general sample size formula that accommodates arbitrarily specifed treatment sequences (v1 , v2 ) , to detect true intervention effect z 0 with power 1 - g at two-sided type I error a, ncomplete

( z1-a /2 + z1-g ) =

2

s 2 (v1 - v2 )¢[W + ( J - 1) F](v1 - v2 ) T

å(v

z p(1 - p) p J[ 2 0

1t

.

(5.19)

2 2

- v2t ) ]

t=1

The subscript “’complete” of sample size (5.19) indicates that it assumes complete observations from every subject. Similarly, we can extend ncomplete to accommodate missing data, n=

( z1-a /2 + z1-g )

2

s 2 (v1 - v2 )¢ éëd˜ ° W + ( J - 1)diag(d )Fdiag( F d )ùû (v1 - v2 ) T

å(v

z p(1 - p) J[ 2 0

1t

,

(5.20)

- v2t ) d t ] 2

2

t=1

where d = (d 1 , ˜ , d T )¢ and d˜ = (d tt¢ )T ´T are defned to refect marginal missing data probabilities and missing data patterns. More details have been provided in Section 5.3.1. The closed form of (5.20) enables researchers to theoretically explore properties generally applicable to both LCRTs and CCRTs. First, as with SW trials, the temporal trend under control ( l ) approximately has no effect on sample size requirement. Second, given all other design parameters, balanced randomization ( p = 0.5 ) minimizes the sample size requirement. In the following, we further explore the properties of LCRTs and CCRTs.

152

Design and Analysis of Pragmatic Trials

5.5.1 Longitudinal cluster randomized trials In LCRTs, there are two treatment arms: control and intervention, where participants are measured repeatedly. Hence, we defne v1 = 0T and v2 = 1T , where 0T and 1T are vectors of length T with all elements being 0 and 1, respectively. Plugging v1 and v2 into Equation (5.20), we can calculate the required number of clusters for LCRTs by

( z1-a /2 + z1-g ) nLCRT =

2

T

s2

T

åå[d w tt¢

tt¢

+ ( J - 1)d td t¢ftt¢ ]

t=1 t¢=1

æ z p(1 - p)J ç ç è 2 0

ö dt ÷ ÷ t=1 ø T

å

,

2

(5.21)

F, where wtt¢ and ftt¢ are the (t , t¢) th elements of correlation matrices W and F respectively. We have several observations about nLCRT . (1) Because d tt¢ ³ 0 , nLCRT increases with any one of the within-subject (longitudinal) correlation coeffcients ( wtt¢ ). (2) Because (J - 1)d td t¢ ³ 0 , nLCRT increases with any one of the between-subject (concurrent or nonconcurrent) correlation coeffcients ( ftt¢ ). (3) Because d tt¢ = d t¢ ( t¢ > t ) under MM is equal or larger than d tt¢ = d td t¢ under IM, given the same marginal observational probabilities d t ( t = 1, ˜ , T ) and non-negative within-subject correlations ( wtt¢ ³ 0 , " t and t¢ ), the MM pattern always leads to an equal or larger nLCRT than the IM pattern. In practice, sometimes researchers design cluster randomized trials with a pre-determined number of clusters. In such cases, the goal is to calculate the required cluster size (J), which can be obtained by T

T

åå (d w tt¢

J=

t=1 t¢=1

nLCRTQ -

tt¢

T

- d td t¢ftt¢ ) ,

T

åå

d td t¢ftt¢¢

t=1 t¢=1

å

2

T z 02 p(1 - p) çæ d t ö÷ t=1 ø è where Q = . 2 ( z1-a /2 + z1-g ) s 2

When there is no missing data, nLCRT can be simplifed to

( z1-a /2 + z1-g ) nLCRTcomplete =

2

T

s2

T

åå[w

tt¢

t=1 t¢=1

z 02 p(1 - p)JT 2

+ ( J - 1)ftt¢ ] .

(5.22)

The proposed sample size formula accommodates arbitrary correlation structures through the specifcations of W and F F. For example, we can assume

153

The GEE Approach for Stepped-Wedge Trial Design

r1 1 ° r1

æ1 ç r1 W=ç ç° çç è r1

r1 ö æ r2 ÷ ç r1 ÷ r3 and F = ç ÷ ç ° ° ÷ çç 1 ÷ø r è 3

˜ ˜ ˛ ˜

r3 r2 ° r3

˜ ˜ ˛ ˜

r3 ö ÷ r3 ÷ . °÷ ÷ r2 ø÷

This specifcation dictates the CS structure for W and F F. Here r1 represents the longitudinal (within-subject) correlation and r2 is the ICC (cross-subject within-period correlation). They are assumed to be constant over time. The cross-subject cross-period correlation ( r3 ) is assumed to be exchangeable within a cluster. Then (22) can be simplifed to

( z1-a /2 + z1-g )

* nLCRTcomplete =

2

s 2 {1 + (T - 1)r1 + (J - 1)[ r2 + (T - 1)r3 ]} , (5.23) z 02 p(1 - p)JT

which is identical to the sample size formula (Equation (5.27)) developed by * Liu et al. [55]. We also have a few observations about nLCRTcomplete : • As cluster size J ® ¥ ,

* ® nLCRTcomplete

( z1-a /2 + z1-g )

2

s 2 [ r2 + (T - 1)r3 ] , z p(1 - p)T 2 0

which does not approach zero. This is true even if we infnitely increase the number of longitudinal measurements (i.e., T ® ¥ ), where * ® nLCRTcomplete

( z1-a /2 + z1-g )

2

s 2 r3 . z p(1 - p) 2 0

That is, in LCRTs, there is a limit that increasing the cluster size or increasing the number of longitudinal measurements can decrease the required number of clusters (or increase power). In other words, in the design of LCRTs, researchers might not be able to compensate for the lack of independent clusters by enlarging the clusters or obtaining more measurements. • When T = 1 , which corresponds to a traditional cluster randomized trial,

* LCRTcomplete

n

( z1-a /2 + z1-g ) =

2

s 2 [1 + (J - 1)r2 ] . z 02 p(1 - p)J

The required number of clusters only depends on ICC ( r2 ).

154

Design and Analysis of Pragmatic Trials

• When J = 1 , which corresponds to an individual randomized trial with longitudinal measurements,

* LCRTcomplete

n

( z1-a /2 + z1-g ) =

2

s 2 [1 + (T - 1)r1 ] . z p(1 - p)T 2 0

* nLCRTcomplete is the required number of unique subjects and it only depends on the longitudinal correlation ( r1 ).

• When both T = 1 and J = 1 , which corresponds to the individual randomized study,

* LCRTcomplete

n

( z1-a /2 + z1-g ) =

2

s2

z p(1 - p) 2 0

.

The last three formulas are identical to the popular sample size calculation methods used in practice [58]. 5.5.2 Crossover Cluster Randomized Trials In CCRTs, clusters are randomly assigned to two treatment sequences, each consisting of both control and intervention, but in opposite orders. Sample size (5.20) can be adapted to accommodate CCRTs through the specifcation of ( v1 , v2 ). For example, we can specify v1 = (0,1, 0,1,…) and v2 = (1, 0,1, 0,…) for CCRTs with multiple periods, as described in Parienti and Kuss [50]. Then the required number of clusters is nCCRT =

( z1-a /2 + z1-g )

2

s 2 (v1 - v2 )¢ ëéd˜ ° W + ( J - 1) diag(d )F Fdiag(d )ûù (v1 - v2 ) æ z 02 p(1 - p)J ç ç è

ö dt ÷ ÷ t=1 ø T

å

2

.

(5.24) For a CCRT with complete observations, Equation (5.24) is simplifed to nCCRTcomplete =

( z1-a /2 + z1-g )

2

s 2 (v1 - v2 )¢ éëW + ( J - 1) F Fùû (v1 - v2 ) z 02 p(1 - p)JT 2

.

(5.25)

Here W and F can be specifed with arbitrary structures. Suppose they have the CS structures with correlation parameters ( r1 , r2 , r3 ) as described in Section 5.5.1, when T is an even number, Equation (5.25) can be rewritten as

155

The GEE Approach for Stepped-Wedge Trial Design

* nCCRTcomplete =

( z1-a /2 + z1-g )

2

s 2 [1 - r1 + (J - 1)( r2 - r3 )] . ) z 02 p(1 - p)JT

(5.26)

When T is an odd number, we have * nCCRTcomplete =

( z1-a /2 + z1-g )

2

s 2 {T - (T - 1)r1 + (J - 1)[T r2 - (T - 1)r3 ]} . z 02 p(1 - p)JT 2

In practice, CCRTs with two periods are most common, where treatment crossover only occurs once and a single measurement is obtained under each treatment. In such cases, we have T = 2 , v1 = (0,1) , v2 = (1, 0) , and the correlation matrices are reduced to æ1 W=ç è r1

r1 ö æ r2 ÷ and F = ç 1ø è r3

r3 ö ÷. r2 ø

Thus, * nCCRTcomplete =

( z1-a /2 + z1-g )

2

s 2 [1 - r1 + (J - 1)( r2 - r3 )] . 2z 02 p(1 - p) pJ

(5.27)

It is straightforward that the sample size requirement for CCRTs decreases with greater longitudinal correlation ( r1 ) and cross-subject cross-period correlation ( r3 ), while increases with greater ICC ( r2 ). There are several special adaptations of the above formula: • If p = 0.5 and r1 = r3 , that is, a balanced randomization with equal longitudinal correlation and cross-subject cross-period correlation, * then nCCRTcomplete is identical to the method proposed by Giraudeau et al. [54]. Furthermore, if r2 = r3 , which implies a common betweensubject correlation regardless of whether the measurements are concurrent or not, we have

* nCCRTcomplete =

( z1-a /2 + z1-g )

2

s 2 (1 - r1 ) . 2z p(1 - p)J 2 0

This formula is the same as the one developed by Vierron and Giraudeau [59]. • If J = 1 , it corresponds to an individual randomized crossover study,

* nCCRTcomplete =

( z1-a /2 + z1-g )

2

s 2 (1 - r1 ) , 2z p(1 - p) 2 0

156

Design and Analysis of Pragmatic Trials

which only depends on the longitudinal correlation ( r1 ). • There are scenarios where multiple measurements (say g) might be obtained under each treatment. We can specify T * = 2g and set v11 = ˜ = v1g = 0 , v1( g +1) = ˜ = v1(2 g ) = 1 , v21 = ˜ = v2 g = 1 , and v2( g +1) = ˜ = v2(2 g ) = 0 . That is, v1 = (0’g , 1’g )¢ and v2 = (1’g , 0’g )¢ . Then sample size requirement can be calculated using the general formula (5.25). 5.5.3 Comparison of nLCRT and nCCRT When calculating sample sizes for LCRTs and CCRTs, the difference lies in the specifcation of v1 and v2 . Utilizing the closed form of sample size formula (5.20), we have developed the following theorem regarding nLCRT and nCCRT . Theorem 1 Given the same design confgurations, under non-negative withinand between-subject correlation coeffcients ( wtt¢ ³ 0 and ftt¢ ³ 0 " t and t¢ ), the LCRT design always leads to a larger sample size requirement than the CCRT design. i.e., nLCRT > nCCRT . The proof is presented in Appendix E. In real clinical trials, within- and between-subject correlations are most likely non-negative. Theorem 1 suggests that a CCRT is always more effcient than an LCRT. Specifcally, under no missing data and the CS correlation structure, the relative effciency between a CCRT and an LCRT is

* * LCRTcomplete CCRTcomplete * CCRTcomplete

-n

n

n

ì (T 2 - 1)[ r1 + (J - 1))r3 ] ï ïT - (T - 1)r1 + (J - 1)[T r2 - (T - 1)r3 ] =í T[ r1 + (J - 1)r3 ] ï ïî 1 - r1 + (J - 1)( r2 - r3 )

when T is odd; when T is even.

(5.28) It is straightforward from (5.28) that the relative effciency increases with longitudinal correlation ( r1 ) and cross-subject cross-period correlation ( r3 ), but decreases with ICC ( r2 ). In summary, for LCRTs, larger r1 , r2 , and r3 are associated with a larger required number of clusters. For CCRTs, however, smaller within-subject (longitudinal) correlation ( r1 ) and cross-subject cross-period correlation ( r3 ), as well as larger ICC ( r2 ), are associated with a larger required number of clusters. Furthermore, for CCRTs with even periods, Equation (5.26) shows that ICC and cross-subject cross-period correlation affect the required number of clusters through their difference, ( r2 - r3 ). Finally, with all other design parameters fxed, it can be shown that the difference in the required number of clusters between LCRTs and CCRTs is independent of ICC ( r2 ):

157

The GEE Approach for Stepped-Wedge Trial Design

* * nLCRTcomplete - nCCRTcomplete

ì ( z1-a/2 + z1-g )2 s 2 [(T 2 - 1)r1 + (J - 1)(T 2 - 1)r3 ] ï z 02 p(1 - p)JT 2 ï =í 2 ï ( z1-a/2 + z1-g ) s 2 [r1 + (J - 1)r3 ] ï z 02 p(1 - p)J î

when T is odd; when T is even.

In Section 5.5.1, we show that for LCRTs the MM missing pattern leads to a larger or equal sample size requirement than IM. For CCRTs, however, with an additional condition of within-subject correlation structure being CS, we can show that the relationship between sample size and missing patterns is in the opposite direction. Theorem 2 Under non-negative within-subject correlation with the CS structure, the MM pattern leads to a smaller or equal sample size requirement than the IM pattern for CCRTs. The proof is presented in Appendix F. Theorem 2 suggests that the impact of missing data on sample size requirement is not straightforward, which depends on correlation structures and treatment sequences. Closed-form sample size formulas are advantageous in providing analytical conclusions and clear guidelines to practitioners. 5.5.4 Adjusting for Small Numbers of Clusters by the t -Distribution In data analysis for cluster randomized trials based on GEE, the variance of treatment effect tends to be underestimated when the number of clusters is limited. This problem can be more severe in LCRTs and CCRTs because besides within-cluster dependence, there is another layer of withinsubject dependence among longitudinal measurements. For example, with a cluster size of J = 30 and T = 3 periods, the cluster-specifc correlation matrix has a dimension of 90 × 90. In Section 5.3.3, we have presented the adjustment approach that involves a correction term to the GEE variance estimator combined with adding one cluster to each arm [34, 39]. Here we present another adjustment approach based on the t-distribution, which is easy to implement and empirically has been demonstrated to perform well in maintaining power and type I error. Specifcally, let zˆ and sˆ T2 +1 / n be the GEE estimator of treatment effect and its variance, respectively. Then in hypothesis testing, instead of a normal threshold of z1-a /2 , the test statistics Z = n |zˆ |/sˆ T +1 is compared with rejection threshold t1-a /2,n -T -1 , which is the 100(1 - a / 2) th percentile of the t-distribution with (n - T - 1) degree of freedom. This canonical approach has been investigated by other researchers [60, 32]. In Table 5.5, we present the simulation results for CCRTs with and without adjustment by the t-distribution. The simulation settings are as

158

Design and Analysis of Pragmatic Trials

follows: We assume T = 3 , s 2 = 0 , p = 0.5 , v1 = (0,1, 0)¢ , and v2 = (1, 0,1)¢ . We set true treatment effect z 0 = 0.12 and consider two cluster sizes, J=15 or 30. The nominal power and type I error are 0.8 and 0.05. The longitudinal correlation W is assumed to have the CS structure with wtt¢ = r1 ( t ¹ t¢ ). For between-subject correlation, we specify F = 11¢ r2 + I( r3 - r2 ) with r3 = r2 / 2 . Four combinations of P = { r1 , r2 , r3 } are explored: = = = =

P1 P2 P3 P4

{0.30,0.05,0.025} ; {0.30,0.03,0.015} ; 2 }; {0.15,0.05,0.025 {0.15,0.03,0.015} .

For missing data, both the IM and MM missing data patterns are considered, with four sets of marginal observational probabilities:

d1 d2 d3 d4

= = = =

(1.00,1.00,1.00 ) ; (1.00,1.00,0.70 ) ; (1.00,0.85,0.70 ) ; (1.00,0.75,0.70 ) .

In Table 5.5, without adjustment, the empirical p-values are generally infated, which is a typical symptom of underestimated variance. After adjustment by the t-distribution, both empirical p-values and powers are closer to their nominal levels. 5.5.5 Accounting for Randomly Varying Cluster Sizes In the above, we have assumed an equal number of subjects across all clusters. That is, we have assumed a constant cluster size J. In practice, however, clusters are often formed naturally and their sizes can vary randomly [55, 53] due to time, logistics, and budget constraints, which constitutes an additional source of variability. A common practice has been to use the average cluster size to estimate the sample size. The resulting experimental design ignores a source of random variability (in cluster size), which might lead to underpowered trials. Sample size (5.20) can be further extended to account for random variability in cluster size. Following the approach developed by [61], we use J i to denote the cluster size of the ith ( i = 1, ˜ , n ) cluster and assume J i ’s to follow a discrete distribution with mean m J and variance s J2 . Based on Model (5.18), with varying cluster sizes J i , the GEE estimator is æ bˆ = ç ç è

n

ö X ¢i Xi ÷ ÷ j=1 ø Ji

åå i=1

-1

æ ç ç è

n

ö

Ji

ååX¢ Y ÷÷ . i ij

i=1 j=1

ø

159

The GEE Approach for Stepped-Wedge Trial Design

TABLE 5.5 Required number of clusters (empirical power, empirical type I error) in CCRTs, with and without adjustment by t-distribution

J Original

30

Missing Pattern IM

MM

15

IM

MM

Adjusted

30

IM

MM

d

P1

P2

d1

43 (0.8072, 0.0645)

34 (0.8239, 0.0681)

46 36 (0.8167, 0.0599) (0.8163, 0.0678)

d2

43 (0.8093, 0.0594)

34 (0.8176, 0.0666)

46 37 (0.8157, 0.0592) (0.8036, 0.0659)

d3

48 (0.8113, 0.0640)

38 (0.8141, 0.0629)

50 41 (0.8130, 0.0645) (0.8213, 0.0572)

d4

52 (0.8083, 0.0567)

42 (0.8178, 0.0641)

54 44 (0.8182, 0.0619) (0.8150, 0.0607)

d2

43 (0.8152, 0.0647)

34 (0.8132, 0.0635)

46 37 (0.8230, 0.0571) (0.8178, 0.0602)

d3

47 (0.8180, 0.0646)

38 (0.8109, 0.0674)

50 40 (0.8061, 0.0620) (0.8136, 0.0633)

d4

51 (0.8110, 0.0630)

40 (0.8157, 0.0588)

53 43 (0.8062, 0.0631) (0.8155, 0.0659)

d1

62 (0.8080, 0.0596)

53 (0.8165, 0.0602)

67 58 (0.8093, 0.0577) (0.8146, 0.0590)

d2

63 (0.8135, 0.0593)

55 (0.8100, 0.0612)

69 61 (0.8052, 0.0609) (0.8143, 0.0579)

d3

70 (0.7973, 0.0569)

61 (0.8053, 0.0600)

75 66 (0.8091, 0.0562) (0.8072, 0.0594)

d4

77 (0.8163, 0.0543)

66 (0.8080, 0.0593)

81 71 (0.8076, 0.0589) (0.8151, 0.0598)

d2

63 (0.8125, 0.0582)

55 (0.8129, 0.0629)

69 61 (0.8094, 0.0609) (0.8072, 0.0648)

d3

69 (0.8091, 0.0580)

60 (0.8123, 0.0598)

75 66 (0.8101, 0.0570) (0.8145, 0.0588)

d4

74 (0.8055, 0.0552)

64 (0.8111, 0.0512)

79 69 (0.8088, 0.0598) (0.8052, 0.0599)

d1

43 (0.7896, 0.0572)

34 (0.8022, 0.0571)

46 36 (0.8012, 0.0514) (0.7986, 0.0573)

d2

43 (0.7908, 0.0525)

34 (0.7952, 0.0572)

46 37 (0.7994, 0.0517) (0.7863, 0.0558)

d3

48 (0.7952, 0.0570)

38 (0.7951, 0.0560)

50 41 (0.7984, 0.0584) (0.8005, 0.0507)

d4

52 (0.7941, 0.0520)

42 (0.8008, 0.0570)

54 44 (0.8047, 0.0565) (0.7990, 0.0542)

d2

43 (0.7992, 0.0575)

34 (0.7902, 0.0551)

46 37 (0.8093, 0.0495) (0.7952, 0.0502)

d3

47 (0.8039, 0.0573)

38 (0.7915, 0.0582)

50 40 (0.7926, 0.0554) (0.7957, 0.0547)

d4

51 (0.7976, 0.0567)

40 (0.7987, 0.0505)

53 43 (0.7916, 0.0583) (0.7989, 0.0586)

P3

P4

(Continued)

160

Design and Analysis of Pragmatic Trials

TABLE 5.5 (CONTINUED) Required number of clusters (empirical power, empirical type I error) in CCRTs, with and without adjustment by t-distribution Missing Pattern

J 15

IM

MM

d

P1

P2

d1

62 (0.7982, 0.0550)

53 (0.8028, 0.0538)

67 58 (0.7982, 0.0527) (0.8047, 0.0547)

d2

63 (0.8031, 0.0542)

55 (0.7971, 0.0547)

69 61 (0.7943, 0.0569) (0.8033, 0.0534)

d3

70 (0.7870, 0.0516)

61 (0.7942, 0.0551)

75 66 (0.7987, 0.0520) (0.7965, 0.0536)

d4

77 (0.8081, 0.0503)

66 (0.7971, 0.0553)

81 71 (0.7978, 0.0546) (0.8039, 0.0553)

d2

63 (0.8020, 0.0524)

55 (0.8007, 0.0574)

69 61 (0.7998, 0.0549) (0.7958, 0.0587)

d3

69 (0.7985, 0.0536)

60 (0.8009, 0.0549)

75 66 (0.8013, 0.0535) (0.8046, 0.0545)

d4

74 (0.7971, 0.0506)

64 (0.8021, 0.0469)

79 69 (0.7983, 0.0558) (0.7938, 0.0549)

P3

P4

Approximately n( bˆ - b ) follows a normal distribution with mean 0 and covariance å n = An-1 En An-1 , where n

An = n-1

åJ X ’ X i

i

i

i=1

and n

En = n

-1

Ä2

ö X¢i eˆij ÷ . ÷ j=1 ø Ji

åå i=1

Note that the variance of As n ® ¥ , we have

æ ç ç è

nzˆ is the (T + 1, T + 1) th element of å n .

An ® A = E( J i X ¢i Xi ) = E( J i )E( X ¢i Xi ) = m J éë pW ¢1W1 + (1 - p)W ¢2 W2 ùû . Here Wk = (IT , vk ) for k = 1, 2 , and P(Xi = W1 ) = p. On the other hand, ì é ï êæç En ® E = E íE ê ï êçè î ë

(

Ä2 ùü ö ï X ¢i eˆij ÷ | J i ú ý ú ÷ j=1 ø úû ïþ Ji

å

= E s 2 J i { pW¢1[W W + (J i - 1)F]W1 + (1 - p)W¢2 [W W + (J i - 1)F] F W2 }

)

= s 2 {pW¢1[E( J i )W + E( J i2 - J i )F]W1 + (1 - p)W¢2 [E( J i )W W + E( J i2 - J i )F F]W2 }.

161

The GEE Approach for Stepped-Wedge Trial Design

Using the fact that E( J i2 ) = m J2 + s J2 and defning H = m J W + ( m J2 + s J2 - m J )F F, we have E = s 2 {pW ¢1 HW1 + (1 - p)W ¢2 HW2 }. Let s T2 +1 be the (T + 1, T + 1) th element of å = A-1 EA-1 . Following a similar derivation as in Appendix A, we can obtain that

s T2 +1 =

s 2 (v1 - v2 )¢ H (v1 - v2 )

.

T

å(v

p(1 - p)m [ 2 J

1t

2 2

- v2t ) ]

t=1

Given the true treatment effect z 0 , to reject H 0 : z = 0 with power 1 - g at a two-sided signifcance level a , the required number of clusters that accounts for random variability in cluster size is calculated by n=

( z1-a /2 + z1-g )

2

(

)

s 2 (v1 - v2 )¢ é m J W + m J2 + s J2 - m J F ù (v1 - v2 ) ë û . T

å(v

z 02 p((1 - p)m J2 [

1t

- v2t )2 ]2

t=1

This sample size can be further extended to accommodate missing data:

*

n =

( z1-a /2 + z1-g )

2

(

)

s 2 (v1 - v2 )¢ é m Jdˆ ˜ W + m J2 + s J2 - m J diag(d )F Fdiag((d )ù (v1 - v2 ) ë û , T

å(v

z 02 p(1 - p)m J2 [

1t

- v2t )2 d t ]2

t=1

where d˜ and d are similarly defned as in (5.20).

Appendix A: Derivation of Equation (5.8) First we have n

An

= n-1 J

åX X ’ i

i

i=1

-1

=n J

n

i=1

As n ® ¥ , An approaches

é IT

å êë u

’ i

ui ù . ui’ui úû

162

Design and Analysis of Pragmatic Trials

S

=J

A

é IT

vs ù vs’ vs úû

åp êë v s

’ s

s=1 S

=J

åp W W . ’ s

s

s

s =1

On the other hand, we have n

=n

En

-1

n

åå i=1 n

= n-1

J

’ i ij

ö eˆ ’ij Xi ÷ ÷ j=1 ø J ö X i’eˆ ijeˆ ’ij’ X i ÷ ÷÷ j’ =1 ø J

å

å åå i=1 n

= n-1

Ä2

ö X eˆ ÷ ÷ j=1 ø J ööæ Xi’eˆ ij ÷ç ÷ç j=1 øè J

åå i=1

= n-1

æ ç ç è æ ç ç è æ ç çç è æ ç çç è

j=1

åå i=1

ö X i’eˆ ijeˆ ’ij¢ Xi ÷ . ÷÷ j=1 j’ = j+ 1 ø

J -1

J

J

åå

X i’eˆ ijeˆ ’ij X i + 2

j=1

As n ® ¥ , En approaches S

E = s 2J

åp W ( W + ( J - 1) F ) W . s

’ s

s

s=1

For the inference about treatment effect z , only s z2 is relevant, which is the ( T + 1, T + 1) th component of å = A-1 EA-1 . The last row of A-1 can be obtained as é êJ êë

-1

ù ut ( 1 - ut ) ú ëé -u’ úû t=1 T

å

1ùû .

Then we have -2

ù ut ( 1 - ut ) ú éë -u’ t=1 ûú

s =

é êJ êë

å

=

é êJ êë

ù ut ( 1 - ut ) ú éë -u’ úû t=1

2 z

T

-2

T

å S

s2 =

1ùû E éë -u’

1ùû

æ 1ùû ç s 2 J ç è

s

åp W ( W + ( J - 1) F )W ÷÷ø ëé-u’ ’ s

s=1



s

s

s=1

é Jê êë

ù ut ( 1 - ut ) ú úû t=1 T

å

2

ö

S

Fùû ( v - u ) ë W + ( J - 1) F åp ( v - u ) éW s



.

s

11ùû



163

The GEE Approach for Stepped-Wedge Trial Design

Because zˆ approximately follows a normal distribution, Equation (5.7) is equivalent to - nz 0 / s z + z1-a /2 = -z1-g . Then the required number of clusters is solved to be

( z1-a /2 + z1-g ) =

n

z

2

s z2

2 0

( z1-a /2 + z1-g ) s 2 2

=

S

ëW + ( J - 1) F ùû ( v - u ) åp ( v - u ) éW ’

s

s

s

s=1

é z 02 J ê êë

ù ut ( 1 - ut ) ú úû t=1 T

å

2

.

Appendix B: Derivation of Equations (5.9) and (5.10) 1 When p1 = ˜ = pS = , ut = S Then T

å t=1

ut ( 1 - ut ) =

T

å

S

psvst =

s=1

1 S

å

S

vst =

s=1

t -1 because T = S + 1 . T -1

T (T - 2)

å èæç T - 1 øèöæ÷ç 1 - T - 1 øö÷ = 6 (T - 1) = t -1

t -1

t=1

S2 - 1 . 6S

The required number of clusters is

( z1-a /2 + z1-g ) n

2

=

S

s2

åp ( v - u ) éëW + ( J - 1) FFùû ( v - u ) ’

s

s=1

é z Jê ëê 2 0

( z1-a /2 + z1-g )

2

s2

=

1 S

ù ut ( 1 - ut ) ú t=1 ûú T

å

S



s

s

s=1

é S2 - 1 ù z 02 J ê ú ë 6S û 2

2

ëéW + ( J - 1) F ùû ( v - u ) å ( v - u ) éW

36 ( z1-a /2 + z1-g ) Ss 2 =

s

s

S

2

å ( v - u ) éëW + ( J - 1) FFùû ( v - u ) ’

s

s

s=1

(

z 02 J S2 - 1

)

2

.

A linear mixed-effect model with cluster and patient random effects would motivate a correlation structure such that the within-subject (longitudinal) correlation W is compound symmetric (CS) with off-diagonal elements being r1 and the between-subject correlation is F = 11’ r2 . In this case, we have

164

Design and Analysis of Pragmatic Trials

S

ëW + ( J - 1) F ùû ( v - u ) å ( v - u ) éW ’

s

s

s=1

S

=

å s=1

S

=

å s=1

éæ 1 êç ( vs - u ) êç ° êç r1 ëè

˜

é æ1 ê ( vs - u ) ê( r1 + ( J - 1) r2 ) çç ° ç1 ê è ë

˜ ˛ ˜



= ( r1 + ( J - 1) r2 )

S

å s=1

= ( r1 + ( J - 1) r2 )

r2 ö ù ÷ú ° ÷ú ( vs - u ) r2 ÷ø úû 1ö æ 1 ˜ 0 öù ç ÷ú ÷ ° ÷ + ( 1 - r1 ) ç ° ˛ ° ÷ú ( vs - u ) ç 0 ˜ 1 ÷ú 1 ÷ø è øû

r1 ö æ r2 ÷ ç ° ÷ + ( J - 1) ç ° ç r2 1 ÷ø è

˜ ˛



2

é( vs - u )’ 1 ù + ( 1 - r1 ) ë û

˜ ˛ ˜

S

å(v - u) (v - u)

)

s

s=1

1 1 S S2 - 1 + ( 1 - r1 ) S2 - 1 12 6

(



s

(

)

1 2 S - 1 éë( S - 2 ) r1 + S ( J - 1) r2 + 2ùû . 12 The above derivation uses the fact that

(

=

)

S

å s=1

2

é( vs - u )’ 1 ù = ë û

S

å s=1

2

1 æ S+1 ö - s ÷ = S S2 - 1 , ç 12 è 2 ø

(

)

and S

S-1

å ( vs - u ) ( vs - u ) = 2å (S - s) æç ’

s=1

2

sö 1 2 ÷ = S -1 . 6 èSø

s=1

(

)

Then the required number of clusters is 36 ( z1-a /2 + z1-g ) Ss 2 2

n

=

ëW + ( J - 1) F ùû ( v - u ) å ( v - u ) éW ’

s

s=1

( 1 (S 12

s

) - 1) éë( S - 2 ) r + S ( J - 1) r + 2 ùû J ( S - 1)

z 02 J S2 - 1 36 ( z1-a /2 + z1-g ) Ss 2 2

=

S

z 02

2

2

1

2

2

2

3 ( z1-a /2 + z1-g ) Ss 2 éë( S - 2 ) r1 + S ( J - 1) r2 + 2ùû 2

=

(

z 02 J S2 - 1

)

.

165

The GEE Approach for Stepped-Wedge Trial Design

Appendix C: Sample Size for Cross-Sectional SW Trials For cross-sectional SW trials, the correlation structure can be modeled by W = 11’ r + ( 1 - r ) I and F = 11’ r . The required number of clusters can be expressed as

( z1-a /2 + z1-g ) n

2

S

s

2

=

åp ( v - u ) éëW + ( J - 1) FFùû ( v - u ) ’

s

s

s

s=1

é z Jê êë 2 0

( z1-a /2 + z1-g )

2

ù ut ( 1 - ut ) ú úû t=1 T

å

S

s2

=

2

åp ( v - u ) éë(1 - r ) I + J111 r ùû ( v - u ) ’

s



s

s=1

2

ù é T z J ê ut ( 1 - ut ) ú úû êë t=1 2 S ì é T ù 2 2 ï ( z1-a /2 + z1-g ) s ps í J r ê ( vst - ut )ú + (1 - r ) ïî ëê t=1 s s=1 ûú = 2 T ù é 2 z 0 J ê ut ( 1 - ut ) ú úû êë t=1 1 Under the special case of ps = for s = 1, … , S , we have S

å

2 0

s

å

å

T

å ( vst - ut )

2

t=1

å

2 ì é T ù ï ( z1-a /2 + z1-g ) s ps í J r ê ( vst - ut )ú + (1 - r ) úû ê s=1 îï ë t=1 = 2 ù é T 2 z 0 J ê ut ( 1 - ut ) ú úû êë t=1 2 1 ( z1-a /2 + z1--g ) s 2 12 S2 - 1 ëé( SJ - 2 ) r + 2ùû = 2 2 2 éS - 1ù z0 J ê ú ë 6S û S

2

n

2

å

å

T

å ( vst - ut ) t=1

2

ü ï ý þï .

ü ï ý ïþ

å

(

)

3 ( z1-a /2 + z1--g ) Ss 2 éë( SJ - 2 ) r + 2ùû 2

=

(

z 02 J S2 - 1

)

based on a similar derivation in Appendix B.

,

166

Design and Analysis of Pragmatic Trials

Appendix D: Derivation of Equation (5.15) First, we have ˆ= J A n

S

ån ( X Gˆ ) ’ s

s

Ä2

.

s

s=1

ˆ approaches As n ® ¥ , A S

A= J

åp ( X G ) ’ s

s

Ä2

s

.

s=1

On the other hand, we have Eˆ

S

=n

-1

ns

S

=n

ns

åå å s=1 i=1 S

= n-1

ns

S

=n

J

’ s sij

ö eˆ ’sij X s ÷ ÷ j=1 ø J ö X s’eˆ sijeˆ ’sij ’X s ÷ ÷÷ j’ =1 ø J

å

.

åå åå s=1 i=1

-1

Ä2

ö X eˆ ÷ ÷ j=1 ø J öæ X s’eˆ sij ÷ç ÷ç j=1 øè J

åå å s=1 i=1

-1

æ ç ç è æ ç ç è æ ç çç è æ ç çç è

ns

j=1

J

åå

åå å s=1 i=1

ö X s’eˆ sijeˆ ’sij¢ X s ÷ . ÷÷ j=1 j’ = j+ 1 ø

J -1

J

X s’eˆ sijeˆ ’sij X s + 2

j=1

As n ® ¥ , Eˆ approaches S

E=J

Fùû G X . ë W + ( J - 1) F åp X G éW s

’ s

s

s

s

s=1

We are only interested in s z2 , which is the ( T + 1, T + 1) component of å = A-1 EA-1 . The last row of A-1 can be simplifed as é êJ ëê Then we have

T

åå t=1

-1

ù wst at (1 - at )ú éë -a s=1 ûú S

1ùû .

167

The GEE Approach for Stepped-Wedge Trial Design

s = 2 z

=

é êJ ëê é êJ êë

T

-2

ù wst at (1 - at )ú éë -a s=1 ûú S

åå t=1 T

-2

ù wst at (1 - at )ú éë - a s=1 ûú S

å åå t=1

1ùû E éë -a

1ùû



S

1ùû J

åp X G éëW + ( J - 1) Fùû G X éë-a ’ s

s

s

s

s

1ùû



s=1

S

åp ( v - a ) G [W + ( J - 1)FF] G ( v - a ) s

=

s

s

s

s=1

é T Jê êë t=1

æ ç ç è

ù ö wst ÷ at ( 1 - at ) ú ÷ úû s=1 ø S

åå

s

2

.

Hence, the required number of clusters can be calculated by n

( z1-a /2 + z1-g )

=

2

s z2

z 02

( z1-a /2 + z1-g ) =

2

S

åp ( v - a )¢G [W + (J - 1)1F] G ( v - a ) s

s

T

æ ç ç è

s

s

s=1

é z 02 J ê êë

ù ö wst ÷ at ( 1 - at ) ú ÷ úû s=1 ø S

åå t=1

2

s

.

Appendix E: Proof of Theorem 1 The general sample size formula (5.20) can be rewritten as

( z1-a /2 + z1-g )

n=

2

s2

T

å(v

z 02 p(1 - p) J[

1t

- v2t )2 d t ]2

T

å(v

{

1t

- v2t )2 d t [1 + (J - 1)d tftt ] +

t=1

t=1

T -1

T

åå (v

2

1t

- v2t )(v1t¢ - v2tt¢ )[d tt¢wtt¢ + (J - 1)d td t¢ftt¢ ]}.

t =1 t ¢=t + 1

First, note that nLCRT and nCCRT are different only by the terms containing vkt ’s ( k = 1, 2 and t = 1, ˜ , T ). In fact, (v1t - v2t )2 = 1 for both LCRTs and CCRTs. Hence, the sample size formula can be further simplifed for LCRTs and CCRTs:

168

Design and Analysis of Pragmatic Trials

n=

( z1-a /2 + z1-g )

2

T

s2

åd [1 + (J - 1)d f ] +

{

T

å

z p(1 - p) J[ 2 0

dt ]

2

t

t ttt

t=1

t=1

T -1

T

åå (v

2

1t

- v2t )(v1t¢ - v2t¢ )[d tt¢wtt¢ + ( J - 1)d td t¢ftt¢ ]}.

t=1 t ¢=t+ 1

For t ¹ t¢ , we can show that (v1t - v2t )(v1t¢ - v2t¢ ) = 1 for LCRTs, and (v1t - v2t )(v1t¢ - v2t¢ ) = (-1)t + t¢ for CCRTs. With wtt¢ ³ 0 and ftt¢ ³ 0 , it is obvious that nLCRT > nCCRT for trials with T ³ 2.

Appendix F: Proof of Theorem 2 For CCRTs, it is straightforward that (v1t - v2t )(v1t¢ - v2t¢ ) = (-1)t + t¢ . Hence, Equation (5.20) can be simplifed as n=

( z1-a /2 + z1-g )

2

s2

T

åd ]

z p(1 - p) J[ 2 0

t

T

åd [1 + (J - 1)d f ]

{

2

t

t tt

t=1

t=1

T -1

T

åå (-1)

t + t¢

+2

[d tt¢wtt¢ + (J - 1)d td t¢ftt¢ ]}.

t=1 t ¢=t+ 1

Assume a CS within-subject correlation structure with wtt¢ = w ³ 0 for t ¹ t¢ . The required number of clusters is affected by missing patterns only through the term containing d tt¢ : T -1

T

åå (-1)

t + t¢

h=2

d tt¢ .

t=1 t ¢=t+ 1

We use superscript MM and IM to indicate values under the two missing patterns. For example, d tt¢IM = d td t¢ and d tt¢MM = d t¢ where t¢ > t . So we have T -1

T

åå (-1)

h MM - hIM = 2

t + t¢

d t¢ (1 - d t )

t=1 t ¢=t+ 1

é T ù é T ù = ê (-1)t (1 - d t )ú × ê (-1)t d t ú êë t =1 úû êë t=1 úû

å

å

T

åd (1 - d ) t

t=1

t

.

169

The GEE Approach for Stepped-Wedge Trial Design

In practice we usually observe a non-increasing trend in the marginal observational probabilities, that is d 1 ³ d 2 ³ ˜ ³ d T > 0 . First, we show that

å

T

(-1)t d t £ 0 :

t=1

If T is an even number ( T ³ 2 ), T

å(-1) d = (-d + d ) +˜ + (-d t

1

t

2

T -1

+ d T ) £ 0;

t=1

If T is an odd number ( T ³ 3 ), T

å(-1) d = (-d + d ) +˜ + (-d t

t

1

2

T -2

+ d T -1 ) - d T £ 0.

t=1

Using a similar argument, we can show that obvious that

å

T

å

T

(-1)t (1 - d t ) ³ 0 . It is also

t=1

d t (1 - d t ) ³ 0 . Hence, we have proved that h MM - hIM £ 0 ,

t=1

or that the MM missing pattern leads to a smaller or equal sample size requirement than the IM missing pattern for CCRTs.

References 1. M. Smith, R. Saunders, L. Stuckhardt, and J.M. McGinnis; Committee on the Learning Health Care System in America; Institute of Medicine. Best Care at Lower Cost: The Path to Continuously Learning Health Care in America. National Academies Press, 2013. 2. R.R. Faden, N.E. Kass, S.N. Goodman, P. Pronovost, S. Tunis, and T.L. Beauchamp. An ethics framework for a learning health care system: A departure from traditional research ethics and clinical ethics. Hastings Center Report, 43(s1):S16–S27, 2013. 3. R.M. Califf, and J. Sugarman. Exploring the ethical and regulatory issues in pragmatic clinical trials. Clinical Trials, 12(5):436–441, 2015. 4. D. Schwartz, and J. Lellouch. Explanatory and pragmatic attitudes in therapeutical trials. Journal of Chronic Diseases, 20(8):637–648, 1967. 5. Z. Zhan, E.R. van den Heuvel, P.M. Doornbos, H. Burger, C.J. Verberne, T. Wiggers, et al. Strengths and weaknesses of a stepped wedge cluster randomized design: Its application in a colorectal cancer follow-up study. Journal of Clinical Epidemiology, 67(4):454–461, 2014. 6. E.B. Larson, C. Tachibana, E. Thompson, G.D. Coronado, L. DeBar, L.M. Dember, et al. Trials without tribulations: Minimizing the burden of pragmatic research on healthcare systems. In Healthcare (Vol. 4, pp. 138–141). Elsevier, 2016. 7. C.A. Brown, and R.J. Lilford. The stepped wedge trial design: A systematic review. BMC Medical Research Methodology, 6(1):54, 2006.

170

Design and Analysis of Pragmatic Trials

8. M.A. Hussey, and J.P. Hughes. Design and analysis of stepped wedge cluster randomized trials. Contemporary Clinical Trials, 28(2):182–191, 2007. 9. A.J. Copas, J.J. Lewis, J.A. Thompson, C. Davey, G. Baio, and J.R. Hargreaves. Designing a stepped wedge trial: Three main designs, carry-over effects and randomisation approaches. Trials, 16(1):1–12, 2015. 10. E. Beard, J.J. Lewis, A. Copas, C. Davey, D. Osrin, G. Baio, et al. Stepped wedge randomised controlled trials: Systematic review of studies published between 2010 and 2014. Trials, 16(1):1–14, 2015. 11. J. Martin, M. Taljaard, A. Girling, and K. Hemming. Systematic review fnds major defciencies in sample size methodology and reporting for steppedwedge cluster randomised trials. BMJ Open, 6(2):e010166, 2016. 12. A.W. Kimball. Errors of the third kind in statistical consulting. Journal of the American Statistician Association, 57:133–142, 1957. 13. N.M. Laird, and J.H. Ware. Random effects models for longitudinal data. Biometrics, 38(4):963–974, 1982. 14. R.I. Jennrich, and M.D. Schlucter. Unbalanced repeated-measures models with structured covariance matrices. Biometrics, 42(4):805–820, 1986. 15. K. Liang, and L.L. Zeger. Longitudinal data analysis using generalized linear models. Biometrika, 73:45–51, 1986. 16. T.J. Ypma. Historical development of the Newton–Raphson method. SIAM Review, 37(4):531–551, 1995. 17. M. Crowder. On the use of a working correlation matrix in using generalised linear models for repeated measures. Biometrika, 82(2):407–410, 1995. 18. B.W. McDonald. Estimating logistic regression parameters for bivariate binary data. Journal of the Royal Statistical Society, Series B, 55(2):391–397, 1993. 19. W.A. Fuller. Regression analysis for sample survey. Sankhya, 37(3):117–132, 1975. 20. D.A. Binder. On the variances of asymptotically normal estimators from complex surveys. International Statistical Review/Revue Internationale de Statistique, 51(3):279–292, 1983. 21. W.W. Hauck Jr, and A. Donner. Wald’s test as applied to hypotheses in logit analysis. Journal of the American Statistical Association, 72(360a):851–853, 1977. 22. F. Lafontaine, and K.J. White. Obtaining any Wald statistic you want. Economics Letters, 21(1):35–40, 1986. 23. J. Wang, J. Cao, S. Zhang, and C. Ahn. Sample size determination for stepped wedge cluster randomized trials in pragmatic settings. Statistical Methods in Medical Research, 30(7):1609–1623, 2021. 24. C. Ahn, M. Heo, and S. Zhang. Sample Size Calculations for Clustered and Longitudinal Outcomes in Clinical Research. CRC Press, 2014. 25. S. Zhang, J. Cao, and C. Ahn. Sample size calculation for before–after experiments with partially overlapping cohorts. Contemporary Clinical Trials, 64:274–280, 2018. 26. Y. Lou, J. Cao, S. Zhang, and C. Ahn. Sample size estimation for a two-group comparison of repeated count outcomes using GEE. Communications in Statistics – Theory and Methods, 46(14):6743–6753, 2017. 27. H. Zhu, S. Zhang, and C. Ahn. Sample size considerations for split-mouth design. Statistical Methods in Medical Research, 26(6):2543–2551, 2017. 28. K. Liang, and S.L. Zeger. Longitudinal data analysis for discrete and continuous outcomes using generalized linear models. Biometrika, 84:3–32, 1986.

The GEE Approach for Stepped-Wedge Trial Design

171

29. M.S. Srivastava, T. von Rosen, and D. Von Rosen. Models with a Kronecker product covariance structure: Estimation and testing. Mathematical Methods of Statistics, 17(4):357–370, 2008. 30. R.J. Little, and D.B. Rubin. Statistical Analysis with Missing Data. John Wiley& Sons, 2014. 31. L.A. Mancl, and T.A. DeRouen. A covariance estimator for GEE with improved small-sample properties. Biometrics, 57(1):126–134, 2001. 32. G. Kauermann, and R.J. Carroll. A note on the effciency of sandwich covariance matrix estimation. Journal of the American Statistical Association, 96(456):1387– 1396, 2001. 33. A. Ziegler, and M. Vens. Generalized estimating equations: Notes on the choice of the working correlation matrix. Methods of Information in Medicine, 49(5):421– 425, 2010. 34. J.G. Morel, M.C. Bokossa, and N.K. Neerchal. Small sample correction for the variance of GEE estimators. Biometrical Journal, 45(4):395–409, 2003. 35. M.P. Fay, and B.I. Graubard. Small-sample adjustments for Wald-type tests using sandwich estimators. Biometrics, 57(4):1198–1206, 2001. 36. W. Pan, and M.M. Wall. Small-sample adjustments in using the sandwich variance estimator in generalized estimating equations. Statistics in Medicine, 21(10):1429–1441, 2002. 37. D.F. McCaffrey, and R.M. Bell. Improved hypothesis testing for coeffcients in generalized estimating equations with small samples of clusters. Statistics in Medicine, 25(23):4081–4098, 2006. 38. C. Fan, D. Zhang, and C.H. Zhang. A comparison of bias-corrected covariance estimators for generalized estimating equations. Journal of Biopharmaceutical Statistics, 23(5):1172–1187, 2013. 39. A. Donner, and N. Klar. Design and Analysis of Cluster Randomization Trials in Health Research. Oxford University Press, 2000. 40. J.N.K. Rao, and A.J. Scott. The analysis of categorical data from complex sample surveys: Chi-squared tests for goodness of ft and independence in two-way tables. Journal of the American Statistical Association, 76(374):221–230, 1981. 41. R. Hooper, S. Teerenstra, E. de Hoop, and S. Eldridge. Sample size calculation for stepped wedge and other longitudinal cluster randomised trials. Statistics in Medicine, 35(26):4718–4728, 2016. 42. R.C.H. Cheng, and N. Amin. Estimating parameters in continuous univariate distributions with a shifted origin. Journal of the Royal Statistical Society: Series B (Methodological), 45(3):394–403, 1983. 43. N.L. Johnson, S. Kotz, and N. Balakrishnan. Continuous Univariate Distributions, Volume 1. John Wiley & Sons, 1994. 44. J. Wang, J. Cao, S. Zhang, and C. Ahn. Sample size and power analysis for stepped wedge cluster randomised trials with binary outcomes. Statistical Theory and Related Fields, 5(2):162–169, 2021. 45. H.A. Feldman, and S.M. McKinlay. Cohort versus cross-sectional design in large feld trials: Precision, sample size, and a unifying model. Statistics in Medicine, 13(1):61–78, 1994. 46. H.B. Bosworth, M.K. Olsen, T. Dudley, M. Orr, M.K. Goldstein, S.K. Datta, et al. Patient education and provider decision support to control blood pressure in primary care: A cluster randomized trial. American Heart Journal, 157(3):450–456, 2009.

172

Design and Analysis of Pragmatic Trials

47. S.J. Arnup, J.E. McKenzie, K. Hemming, D. Pilcher, and A.B. Forbes. Understanding the cluster randomised crossover design: A graphical illustration of the components of variation and a sample size tutorial. Trials, 18(1):381, 2017. 48. C.S. Anderson, H. Arima, P. Lavados, L. Billot, M.L. Hackett, V.V. Olavarra, et al. Cluster-randomized, crossover trial of head positioning in acute stroke. New England Journal of Medicine, 376(25):2437–2447, 2017. 49. G.R. Norman, and S. Daya. The alternating-sequence design (or multipleperiod crossover) trial for evaluating treatment effcacy in infertility. Fertility and Sterility, 74(2):319–324, 2000. 50. J.-J. Parienti, and O. Kuss. Cluster-crossover design: A method for limiting clusters level effect in community-intervention studies. Contemporary Clinical Trials, 28(3):316–323, 2007. 51. K.L. Grantham, J. Kasza, S. Heritier, K. Hemming, E. Litton, and A.B. Forbes. How many times should a cluster randomized crossover trial cross over? Statistics in Medicine, 38(25):5021–5033, 2019. 52. A. Roy, D.K. Bhaumik, S. Aryal, and R.D. Gibbons. Sample size determination for hierarchical longitudinal designs with differential attrition rates. Biometrics, 63(3):699–707, 2007. 53. M. Heo, and A.C. Leon. Sample size requirements to detect an intervention by time interaction in longitudinal cluster randomized clinical trials. Statistics in Medicine, 28(6):1017–1027, 2009. 54. B. Giraudeau, P. Ravaud, and A. Donner. Sample size calculation for cluster randomized cross-over trials. Statistics in Medicine, 27(27):5578–5585, 2008. 55. A. Liu, W.J. Shih, and E. Gehan. Sample size and power determination for clustered repeated measurements. Statistics in Medicine, 21(12):1787–1801, 2002. 56. X.M. Tu, J. Kowalski, J. Zhang, K.G. Lynch, and P. Crits-Christoph. Power analyses for longitudinal trials and other clustered designs. Statistics in Medicine, 23(18):2799–2815, 2004. 57. J. Wang, J. Cao, S. Zhang, and C. Ahn. A fexible sample size solution for longitudinal and crossover cluster randomized trials with continuous outcomes. Contemporary Clinical Trials, 109:106543, 2021. 58. A. Donner, N. Birkett, and C. Buck. Randomization by cluster: Sample size requirements and analysis. American Journal of Epidemiology, 114(6):906–914, 1981. 59. E. Vierron, and B. Giraudeau. Sample size calculation for multicenter randomized trial: Taking the center effect into account. Contemporary Clinical Trials, 28(4):451–458, 2007. 60. S.R. Lipsitz, and G.M. Fitzmaurice. Sample size for repeated measures studies with binary responses. Statistics in Medicine, 13(12):1233–1239, 1994. 61. A.K. Manatunga, M.G. Hudgens, and S. Chen. Sample size estimation in cluster randomized studies with varying cluster size. Biometrical Journal, 43(1):75–86, 2001.

6 The Mixed-Effect Model Approach and Adaptive Strategies for Stepped-Wedge Trial Design

6.1 A Brief Review of Mixed-Effect Models The generalized estimating equation (GEE) approach specifes populationaveraged models that require a marginal mean model as well as a correlation model that directly accounts for the association structure among observations. The mixed-effect model approach, on the other hand, requires the specifcation of a conditional mean model with an association structure induced by random effects [1]. Mixed-effects models are called “mixed” because they simultaneously include fxed and random effects, the former representing average effects that are expected to persist (across experimental conditions and time, etc.), and the latter modeling the extent to which these trends vary across levels of some grouping factor (e.g., participants or clusters). In a closed-cohort SW trial, let Yijt be the outcome measured from the jth ( j = 1, ˜ , J ) subject of the ith (I = 1, ˜ , n) cluster at Step t (t = 1, ˜ , T ). Defne Yij = (Yij1 , ˜ , YijT )¢ and Yi = (Yi1’,˜, YiJ’)¢ . The mixed-effect model approach assumes Yijt to follow a parametric distribution, for example, a normal distribution for a continuous outcome, a Poisson model for a count outcome. Let E(Yijt ) = mijt , and defne mi to be a vector of mijt in a similar fashion as Yi . Then a general form of the mixed-effect model is g( mi ) = Xi b + Zia i , where g(×) is the link function specifed according to the distribution of Yijt , b is the vector of fxed effects that model averaged trends (such as the baseline temporal trend and intervention effect), Xi is the design matrix for fxed effects, and a i and Zi are the vector of random effects and its design matrix, respectively. a i can include multiple sources of random variation. For example, it can include cluster random effects, individual random effects, as well as random effects for cluster-step interactions. The random effects

DOI: 10.1201/9781003126010-6

173

174

Design and Analysis of Pragmatic Trials

can be assumed to be independent, which leads to an exchangeable correlation structure among cluster members, or as a vector can be assumed to have a certain dependence structure (such as exponential decaying) [2]. More examples will be presented later. Unlike the GEE method, the correlation structures implied by random effects are assumed on the link function, not on the observations themselves. In summary, a likelihood function is specifed for the outcome, which belongs to the exponential family. The model parameters are often estimated by the maximum likelihood method or by a variant known as restricted maximum likelihood (REML) [3, 4]. The variance of the estimators can be obtained from approximate information matrices. Within the context of SW trials, when the cluster size is constant at each step, researchers have utilized the properties of cluster-step means to simplify the derivation of sample size. The impact of varying cluster sizes on sample size requirement has been investigated by [5, 6]. In this chapter, we also review correlation structures that have been frequently employed in mixed-effect models specifed for SW trial design.

6.2 Sample Size Calculation Based on Cluster-Step Means For SW closed-cohort trials with a continuous outcome, Hooper et al. [7] considered the following model: Yijt = lt + uitz + t i + q ij + hit + eijt ,

(6.1)

where Yijt is the continuous response measured at Step t (t = 1, ˜ , T ) from Subject j ( j = 1, ˜ , J ) of Cluster i (i = 1, ˜ , n). On the right-hand side, l t denote step-specifc intercepts; uit indicates the treatment received by Cluster i at Step t, with 0/1 denoting control/intervention; z is the intervention effect; ti is the random effect of cluster; q ij is the random effect of Subject j from Cluster i; hit is the random effect of cluster-step interaction; and eijt is the residual error term. Hence, ui = (ui1 , ˜ , uiT )¢ represents the sequence of treatments assigned to Clusters i. Randomization is performed so that P(ui = vl ) = 1/ L , with {vl , l = 1, ˜ , L} being the set of unique sequences employed in the trial. Under the classical scheme of SW trials, such as that depicted in Figure 5.1 (Chapter 5), we have L = T - 1. In practice, there might be cases where researchers want to skip treatment crossover at certain steps, due to, for example, logistic constraints or the need to review interim results. In such cases, we have L < T - 1. We defne the matrix V = {vlt } , which has a dimension of L ´ T . The random effects are assumed to be normally distributed,

Mixed-Effect Model Approach and Adaptive Strategies

175

t i  N(0, s t2 ), q ij  N(0, s q2 ), hit  N(0, s h2 ), eijt  N(0, s e2 ). Hence, the marginal variance of Yijt is s 2 = s t2 + s q2 + s h2 + s e2 . Defne Yit =

å

J

Yijt / J to be the cluster-step mean of all observations from Cluster

j=1

i at Step t. The basic idea of Hooper et al. [7] is that an unbiased estimator of treatment effect z can be constructed based on a linear combination of cluster-step sample means, {Yit , i = 1, ˜ , n and t = 1, ˜ , T}. To facilitate discussion, frst, we defne

r = (s h2 + s t2 )/ s 2 , p = s t2 /(s h2 + s t2 ), t = s q2 /(s e2 + s q2 ). Here r is the intracluster correlation (correlation between two subjects from the same cluster at the same step), and p is the correlation between two population means (not sample means) from the same cluster but at different steps, called cluster autocorrelation; and t is, given a cluster, the correlation between observations of the same subject across steps, called the individual autocorrelation [8]. The sample size for an SW trial can be obtained through the following steps: First, we consider a traditional 1:1 individually randomized (IR) trial that only measures the outcome once. Hence, cluster size J = 1 and there is only one step (T = 1). The number of individuals required to detect treatment effect z 0 with power 1 - g at two-sided type I error a is calculated by nIR =

4s 2 ( z1-a /2 + z1-g )2 . z 02

(6.2)

Then we consider a traditional 1:1 cluster randomized (CR) trial which only measures the outcome once, with cluster size J and T = 1. A convenient approach is to calculate sample size using (6.2) by treating the cluster means {Yi , i = 1, ˜ , n} as independent (individual) observations, with variance Var(Yi ) =Deff 1( J , r )× s 2 / J. Here Deff 1( J , r ) = 1 + (J - 1)r

176

Design and Analysis of Pragmatic Trials

has been called the design factor that accounts for the impact of clustered randomization. Deff 1 depends on two parameters: cluster size J and intracluster correlation r . Then the required number of clusters is calculated by nCR =Deff 1( J , r ) ×

nIR . J

Note that the treatment effect estimator is a linear combination of Yi ’s. Finally, we consider a closed-cohort SW trial. When there is no missing data and the cluster size (J) is constant, it is obvious that the treatment effect can be estimated by a linear combination of cluster-step means Yit ’s. Besides clustering, the variance of the estimator is further affected by the transitioning scheme (represented by V ) and longitudinal correlation r = Corr(Yit , Yit¢ ) for t ¹ t¢ . It can be shown that r=

J rp + (1 - r )t . 1 + (J - 1)r

Hooper et al. [7] derived the design factor that accounts for the stepwise transitioning aspect of SW trials: Deff 2(V , r) =

L2 (1 - r)[1 + (T - 1)r] 4{LB - D + [B2 + L(T - 1)B - (T - 1)D - LC]r}

with terms B, C, and D defned based on matrix V : L

B=

T

L

æ ç ç è

T

ååv , C = å å lt

l=1 t=1

l=1

t

2

ö vlt ÷ , D = ÷ ø

T

æ ç ç è

åå t=1

2

ö vlt ÷ . ÷ l=1 ø L

Then the total number of required clusters for a closed-cohort SW trial is nSW =Deff 2(V , r ) ×Deff 1( J , r )×

nIR . J

(6.3)

Under the classic SW transitioning scheme where æ0 ç ç0 V = ç0 ç ç° ç0 è

1 0 0 ° 0

1 1 0 ° 0

˜ ˜ ˜ ˛ ˜

1 1 1 ° 0

1ö ÷ 1÷ 1÷ , ÷ °÷ 1 ÷ø

we have L = T - 1 and Deff 2 can be simplifed to Deff 2(V , r) =

3L(1 - r)(1 + Lr) . (L2 - 1)(2 + Lr)

(6.4)

Mixed-Effect Model Approach and Adaptive Strategies

177

Through different defnitions of V , this approach is fexible to account for many popular trial schemes. For example, æ0ö V =ç ÷ è 1ø corresponds to parallel cluster randomized trials with one measurement of outcome; æ0 V =ç è1

0ö ÷ 1ø

˜ ˜

corresponds to parallel cluster randomized trials with longitudinal outcome measurements; and æ0 V =ç è1

1ö ÷ 0ø

corresponds to crossover cluster randomized trials. Hooper and Bourke [9] have presented design factor Deff 2 for various trial schemes. For example, Deff 2(V , r) = (1 - r)/ 2 for crossover cluster randomized trials. It is noteworthy that the design factor Deff 1( J , r ) of clustering and the matrix V of transitioning scheme are generally applicable to both cross-sectional and closed-cohort SW trials. From the experimental design’s point of view, the difference between a cross-sectional SW trial and a closed-cohort SW trial lies in whether the same patients are measured multiple times across steps, which affects r, the autocorrelation between cluster-step sample means. Researchers have assumed different models for data arising from cross-sectional SW trials. For example, Hooper et al. [7] assumed the following model, Yijt = l t + uitz + ti + hit + eijt ,

(6.5)

which is different from (6.1), the model assumed for closed-cohort SW trials, by removing individual random effects q ij . Under this assumption, the marginal variance of Yijt is s 2 = s t2 + s h2 + s e2 and the autocorrelation between cluster-step sample means is r = Corr(Yit , Yit¢ ) =

J rp ,for t ¹ t¢. 1 + (J - 1)r

Further simplifed models, where the step random effects hit are removed from (6.5), have been employed by researchers when developing sample size formulas for cross-sectional SW trials [10–12]. This assumption effectively sets p = 1 and r = J r /[1 + (J - 1)r ]. Under this framework, however, as long as the values of V and r are determined, we can calculate the design factor Deff 2(V , r) by Equation (6.4), and

178

Design and Analysis of Pragmatic Trials

in turn the required number of clusters by Equation (6.3), no matter if it is a cross-sectional or a closed-cohort SW trial. This property has also been noted by Teerenstra et al. [8] in the context of analysis of covariance for cluster randomized trials. In sample size calculation for SW trials with binary outcomes, based on normality approximation, linear mixed-effect models similar to (6.1) have also been employed. For example, in the context of cross-sectional SW trials, Hussey and Hughes [10] and Hemming et al. [13] assumed the following model for the binary response observed from the jth subject in the ith cluster at Step t, Yijt = lt + uitz + t i + eijt ,

(6.6)

which drops individual random effects q ij and step-within-cluster random effects hit from Model (6.1). The other parameters are defned similarly as in (6.1). Let P0 and P1 denote the response rate under control and intervention, respectively. Then, the parameter z represents the treatment effect z = P1 - P0 and the null hypothesis of interest is H 0 : z = 0. The marginal variance of Yijt is s 2 = s t2 + s e2 and ICC is r = s t2 / s 2 . Recall that s t2 and s e2 are the variances of cluster random effect t i and residual error eijt , respectively. As the variance of a binary outcome, different specifcations of s 2 have been employed by researchers. For example, based on the rationale that p-value is defned as the probability of a test statistics being greater or equal to the observed value assuming the null hypothesis to be true, researchers have assumed s 2 to equal to the null variance,

s 2 = P0 (1 - P0 ). This specifcation was adopted by Hussey and Hughes [10]. Alternatively, Baio et al. [14] used pooled variance,

s2 =

P0 + P1 æ P0 + P1 ö ç1÷, 2 è 2 ø

which is based on the average response rate. Finally, Hemming et al. [13] directly used the averaged variance,

s2 =

P0 (1 - P0 ) + P1(1 - P1 ) . 2

The above specifcations have also been assumed for the residual variance s e2 instead of the marginal variance. Regardless of the particular specifcation, the key takeaway is that under the above setup, once s 2 (or s e2 ) and ICC are specifed, the autocorrelation between cluster-step means, r = Corr(Yit , Yit¢ ) , can be obtained. Then, design

Mixed-Effect Model Approach and Adaptive Strategies

179

factors Deff 2(V , r) and Deff 1( J , r ) are similarly applicable and Formula (6.3) can be used to assess sample size requirement for SW trials with binary outcomes. In summary, sample size calculation based on the cluster-step means approach provides a closed-form sample size formula and intuitively decomposes the impact of SW design on sample size requirement into two design factors: Deff 1( J , r ) accounting for clustering and Deff 2(V , r) for stepwise transitioning of treatments over the steps. Through different specifcations of r, the autocorrelation between cluster-step means, this sample size approach accommodates both closed-cohort and cross-sectional SW trials. Utilizing normality approximation, it is also applicable to SW trials with binary outcomes. On the other hand, this approach has a few limitations. First, it assumes equal cluster sizes and balanced randomization among the L treatment sequences (described by V ). In practice, clusters are often formed naturally (for example, by physicians or by neighborhoods) and have variable sizes; second, its derivation requires the normality assumption. Deviation from normality, however, has been frequently encountered in biomedical research; third, it assumes longitudinal correlations to be constant regardless of the temporal distance between two measurements. More sophisticated correlation structures, such as the AR(1) structure, were not considered. Finally, by assuming equal cluster sizes in closed-cohort SW trials, it effectively assumes complete observations from every subject across all steps. This assumption can be unrealistic for most pragmatic trials, which are conducted in routine clinical settings. 6.2.1 Commonly Used Correlation Structures Correlation structure is an important factor in the design and analysis of SW trials. When the mixed-effect model approach is employed, dependence due to clustered and longitudinal measurements is modeled through the specifcation of random effects. Here, we review correlation structures that have been frequently adopted in mixed-effect models. 6.2.1.1 Exchangeable Correlation Structure Using the same notations as Model (6.1), the model specifed by Hussey and Hughes [10] can be presented as follows, Yijt = l t + uitz + ti + eijt .

(6.7)

Recall that Yijt is a continuous outcome measured from the jth ( j = 1, ˜ , J ) subject of the ith (I = 1, ˜ , n) cluster at Step t (t = 1, ˜ , T ) in an SW trial. l t is the step-specifc intercept, uit = 0 /1 indicates the treatment received by Cluster i at Step t, z represents the intervention effect, ti  N(0, s t2 ) is the

180

Design and Analysis of Pragmatic Trials

cluster random effect, and eijt  N(0, s e2 ) is residue error independent of ti . Model (6.7) has been frequently used in power analysis for cross-sectional SW trials. It assumes a within-cluster exchangeable correlation structure, with Corr(Yijt , Yij¢t¢ ) = s t2 /(s t2 + s e2 ) = r . Hussey and Hughes [10] derived the variance of the intervention effect estimator, Var(zˆ ) =

ns 2 (s 2 + Ts t2 ) , (nU - W)s + (U 2 + nTU - TW - nV )s t2

(6.8)

2

å å

å å

2

n T T n æ s 2 = s t2 + s e2 , U= uit , W= uit ö÷ , and ç i=1 t=1 t=1 è i=1 ø 2 n T æ V= uit ö÷ . The derivation of (6.8) forms the basis of subsequent ç i=1 è t=1 ø power analysis. Li et al. [2] re-wrote (6.8) as

where

å å

ˆ = Var(z)

(s 2 / J )nT k1k 2 , (U + nTU - TW - nV)k 2 - (U 2 - nV)k1 2

(6.9)

where k1 = 1 - r and k 2 = 1 + (TJ - 1)r are two distinct eigenvalues of the within-cluster exchangeable correlation matrix. They further showed that (6.9) represents a general form that applies to several other model variants. 6.2.1.2 Nested Exchangeable Correlation Structure Hooper et al. [7] and Girling and Hemming [15] employed a nested exchangeable correlation structure through model Yijt = l t + uitz + ti + hit + eijt .

(6.10)

It includes an additional random effect for cluster-step interaction, hit  N(0, s h2 ). Two types of within-cluster correlation arise from Model (6.10),

rw = (s t2 + s h2 )/(s t2 + s h2 + s e2 ), rb = s t2 /(s t2 + s h2 + s e2 ), where rw and rb correspond to correlation among observations within the same period and between periods, respectively. Defne Yit = (Yi1t , ˜ , YiJt )¢ to be the cluster-step specifc vector of observations and Yi = (Yit’,˜, YiT’)¢ to be

181

Mixed-Effect Model Approach and Adaptive Strategies

the collection of observations from Cluster i. Given T = 3 steps and cluster size J = 2, we have æ 1 ç ç rw ç rb Corr(Yi ) = ç ç rb ç rb ç ç rb è

rw 1 rb rb rb rb

rb rb 1 rw rb rb

rb rb rw 1 rb rb

rb rb rb rb 1 rw

rb ö ÷ rb ÷ rb ÷ ÷ rb ÷ rw ÷ ÷ 1 ÷ø

Li et al. [2] showed that the variance of the estimated treatment effect also takes the form of (6.8) with s 2 = s t2 + s h2 + s e2 , k1 = 1 + ( J - 1)rw - J rb , and k 2 = 1 + (J - 1)rw + J (T - 1)rb , where k1 and k 2 are two distinct eigenvalues of the nested exchangeable correlation matrix. Defne Yit =

å

J

Yijt / J to be the cluster-step mean. It can be shown that

j=1

Corr(Yit , Yit¢ ) =

r J rb ® b as J ® +¥, for t ¹ t¢. rw 1 + ( J - 1) rw

The ratio rb / rw has been called the cluster autocorrelation (CAC), which is interpreted as the limit of correlation between two cluster-step population means [7, 2]. An alternative type of nested exchangeable correlation structure was considered by Baio et al. [14], based on model Yijt = lt + uitz + t i + q ij + eijt ,

(6.11)

where q ij  N(0, s q2 ) are random effects for individuals in a closed-cohort SW trial. Model (6.11) leads to two correlation coeffcients:

rw* = (s t2 + s q2 )/(s t2 + s q2 + s e2 ), rb* = s t2 /(s t2 + s q2 + s e2 ), where rw* is the correlation among measurements from the same individual, and rb* is the correlation among measurements from different individuals within a cluster. Note that rw* and rb* are exchangeable with respect to steps (time), while rw and rb are exchangeable with respect to individuals. If we rewrite Yi as Yi* = (Yi1*¢ , ˜ , YiJ*¢ )¢ with Yij* = (Yij1 , ˜ , YijT )¢ being individual-specifc vectors, then the nested exchangeable structure is more obvious:

182

Design and Analysis of Pragmatic Trials

æ 1 ç * ç rw ç rw* Corr(Yi* ) = ç * ç rb ç r* ç b ç rb* è

rw* 1 rw* rb* rb* rb*

rw* rw* 1 rb* rb* rb*

rb* rb* rb* 1 rw* rw*

rb* rb* rb* rw* 1 rw*

rb* ö ÷ rb* ÷ rb* ÷ ÷. rw* ÷ rw* ÷ ÷ 1 ÷ø

Recall that here we assume J = 2 and T = 3. Li et al. [2] pointed out that Var(hˆ ) shares the form of Equation (6.8) with s 2 = s t2 + s q2 + s e2 , k1 = 1 - rw* , and k 2 = 1 + (T - 1)rw* + T ( J - 1)rb* . Here k1 and k 2 are eigenvalues of the within-cluster nested exchangeable correlation matrix. 6.2.1.3 Block Exchangeable Correlation Structure Hooper et al. [7] and Girling and Hemming [15] considered the following model: Yijt = lt + uitz + t i + q ij + hit + eijt . Note that this model is the same as (6.1), which we discussed in the context of sample size calculation based on cluster-step means in Section 6.2. Here we focus on the resulting block exchangeable correlation structure. It involves three correlation coeffcients:

ra = (s t2 + s q2 )/(s t2 + s q2 + s h2 + s e2 ), rw = (s t2 + s h2 )/(s t2 + s q2 + s h2 + s e2 ), rb = s t2 /(s t2 + s q2 + s h2 + s e2 ), where ra is the correlation among observations from the same subject, rw is the correlation among observations measured at the same step from a cluster, and rb is the correlation among observations measured at different steps from different subjects of a cluster. Given cluster size J = 2 and T = 3 steps, the correlation matrix is æ 1 ç ç rw ç ra Corr(Yi ) = ç ç rb ç ra ç ç rb è

rw 1 rb ra rb ra

ra rb 1 rw ra rb

rb ra rw 1 rb ra

ra rb ra rb 1 rw

rb ö ÷ ra ÷ rb ÷ ÷. ra ÷ rw ÷ ÷ 1 ÷ø

183

Mixed-Effect Model Approach and Adaptive Strategies

It has been called the block exchangeable correlation structure because the correlation matrices of cluster-step vectors (Yit ’s) are exchangeable across steps. Importantly, the variance of estimated treatment effect is of the same form as (6.8), with s 2 = s t2 + s q2 + s h2 + s e2 , k1 = 1 + (J - 1)( rw - rb ) - ra , and k 2 = 1 + (J - 1)rw + (T - 1)(J - 1)rb + (T - 1)ra . Here k1 and k 2 are eigenvalues of the block exchangeable correlation matrix. Finally, Li et al. [16] investigated the impact of different correlations on trial effciency: larger values of rw are associated reduced effciency, while larger values of rb and ra are associated with greater effciency. 6.2.1.4 Exponential Decay Correlation Structure Kasha et al. [17] employed an exponential decay correlation structure, which allows correlation to decay exponentially over time, through model Yijt = lt + uitz + hit + eijt .

(6.12)

The vector of random effects hi = (hi1 , ˜ ,hiT )¢ is assumed to follow a multivariate normal distribution, N(0, s h2M), with matrix M taking the form æ 1 ç r M =ç ç ° ç T -1 çr è

r

r2

˜

1 °

r °

rT -2

r T -3

˜ ˛ ˜

r T -1 ö ÷ rT -2 ÷ , r < 1. ° ÷ ÷ 1 ÷ø

(6.13)

Hence, the correlation among observations from the same step is Corr(Yijt , Yij¢t ) = rw = s h2 /(s h2 + s e2 ), while across steps (t ¹ t¢ ), |t - t ¢| Corr(Yijt , Tij¢t¢ ) = rw r . Note that Corr(Yijt , Tij¢t¢ ) applies both to within-subject ( j = j¢ ) and to between-subject ( j ¹ j¢ ) observations. Given T = 3 steps and cluster size J = 2, an exponential decay correlation structure has the form æ 1 ç ç rw ç rw r Corr(Yi ) = ç ç rw r ç r r2 ç w ç rw r 2 è

rw 1 rw r rw r rw r 2 rw r 2

rw r rw r 1 rw rw r rw r

rw r rw r rw 1 rw r rw r

rw r 2 rw r 2 rw r rw r 1 rw

rw r 2 ö ÷ rw r 2 ÷ rw r ÷ ÷. rw r ÷ rw ÷ ÷ 1 ÷ø

(6.14)

The variance of estimated treatment effect does not have a closed form under exponentially decaying correlation. However, Var(zˆ ) can be calculated numerically using a general formula presented in [17].

184

Design and Analysis of Pragmatic Trials

Researchers have also referred to (6.14) as the discrete-time exponential decay correlation because it effectively assumes measurements to be obtained at the same time within each step. This assumption can be oversimplifcation of reality because, more often than not, patients are continuously enrolled and measured. Grantham et al. [18] proposed an extension that allows correlation decay to depend on the actual distances between measurement times. Design and analysis methods that account for continuous-time exponential decay correlation, however, have been lacking in literature. 6.2.1.5 Proportional Decay Correlation Structure Li et al. [2] presented a mixed-effect model, Yijt = l t + uitz + hit + eijt , where besides assuming hi = (hi1 ,˜ ,hiT )¢ ~ N(0, s h2M), it assumes a similar autoregressive structure on the residual errors. That is, it assumes eij = (eij1 ,˜ , eijT )¢ ~ N(0, s e2 M). Recall that matrix M is defned in (6.13). The above specifcation leads to a proportional decay correlation structure, which was frst used in the analysis of multi-level longitudinal data [19, 20], and has been applied to the design and analysis of closed-cohort SW trials [21]. It assumes the within-subject correlations to decay exponentially with the temporal distance between measurements, Corr(Yijt , Yijt¢ ) = r|t -t¢|. Furthermore, it assumes the between-subject correlations to follow the same rate of decay, Corr(Yijt , Yij¢t¢ ) = rw r|t -t¢| = rw Corr(Yijt , Yijt¢ ), j ¹ j¢, with rw = s h2 /(s h2 + s e2 ). In other words, there is a proportional relationship between within-subject and between-subject correlations. An example matrix given J = 2 and T = 3 is æ 1 r rw rw r r2 rw r 2 ö ç ÷ 1 rw r r rw r 2 r2 ÷ ç rw ç r 1 rw r rw r rw r ÷ ÷. Corr(Yi ) = ç (6.15) 1 r rw rw r r ÷ ç rw r ç r2 rw r 2 r rw r 1 rw ÷ ç ÷ 2 2 ç rw r r rw r r rw 1 ÷ø è Lefkopoulou et al. [20] showed that the proportional decay correlation matrix can be written as a Kronecker product between a CS and an AR(1) matrix. This property facilitates analytical derivation of the variance of estimated treatment effect: Var(zˆ ) =

s 2n(1 - r 2 )[1 + (J - 1)rw ]/ J , (nU - W)(1 + r 2 ) - 2(nD - E)r

185

Mixed-Effect Model Approach and Adaptive Strategies

å å E = å æç å u öæ ÷ç å è øè

where U =

n

T

i=1

t=1

T -1

n

t=1

i=1

it

uit , W = n

å å æ ç t=1 è

T

2

uit ö÷ , D = i=1 ø n

å å n

T -1

i=1

t=1

uitui ,t +1 , and

ui ,t +1 ö÷ . ø

i=1

6.3 Adaptive Strategies for SW Trials Adaptive strategies enable researchers to modify design in the middle of a clinical trial based on interim observations, without destroying the validity and integrity of intended studies [22]. The process of stepwise crossover to intervention in SW trials presents a great potential where various (Bayesian and frequentist) adaptive design methods can be implemented. Generally, fexibility can be increased, while patient exposure, cost, and trial duration can be reduced. Here we investigate the incorporation of existing adaptive approaches such as frequentist group sequential design and Bayesian predictive probability approach into SW trials. 6.3.1 Group Sequential Design for SW Trials The group sequential design enables researchers to perform analysis on accumulating data in mid-trial, based on which the trial can be stopped early in case of overwhelming effcacy or futility [23–25]. To implement group sequential design in SW trials, in addition to standard design parameters, the interim stages (the steps at which to perform interim analysis) and the corresponding stopping boundaries need to be specifed. At each interim stage, accumulated data up to that point are analyzed and the test statistics is compared with stopping boundaries to determine whether the trial should stop or continue. This design enables trials to stop early if interim data show overwhelming evidence of effcacy or futility. Performing multiple interim tests results in an infated type I error. To address this problem, different approaches have been proposed to specify stopping boundaries so the overall type I error is appropriately controlled. Without loss of generality, we describe the adaptive strategies in the context of a typical cross-sectional SW trial, with T steps and n clusters. All clusters start from control, and they are randomized to switch to intervention at Step t (t = 2, ˜ , T ) and remain on intervention until the end of study, resulting in S = (T - 1) possible sequences. Let ps be the probability of a cluster being assigned to sequence vs = (vs1 , ˜ , vsT )¢ with

å

S

ps = 1, where vs is defned so

s=1

that its elements vst = 0 for t = 1, ˜ , s and vst = 1 for t = s + 1, ˜ , T . Here value 0/1 indicates control/intervention. In a cross-sectional SW-CRT, at each step, a new panel of J subjects per cluster are enrolled. We assume that each subject

186

Design and Analysis of Pragmatic Trials

contributes one outcome measurement. Let Yijt be the measurement of a continuous outcome obtained at Step t ( t = 1, … , T ) from Subject j( j = 1,…, J ) of the ith cluster (i = 1,… , n). Mixed-effect models have been assumed for Yijt . For example, Hooper et al. [7] considered Yijt = l t + uitz + ti + q ij + hit + eijt ,

(6.16)

where lt denote step-specifc intercepts; uit indicates the treatment received by Cluster i at Step t, with 0/1 denoting control/intervention; z is the intervention effect; t i is the random effect of cluster; q ij is the random effect of Subject j from Cluster i; hit is the random effect of step within cluster; and eijt is the residual error term. Defne ui = (ui1 , ˜ , uiT )¢ . By randomization, P(ui = vs ) = ps . The hypothesis of interest is H 0 : z = 0. We defne the set of interim stages H = {h1 ,˜hK } such that interim analysis will be performed at Step hk , k = 1, ˜ , K . We assume that h1 > 1 and hK = T. At interim Stage hk , based on accumulated data up to that point, the estimated treatment effect is denoted by zˆ ( k ). We also defne the Wald test statistics Z( k ) = zˆ ( k ) / s z( k ) , where s z2(k) is the estimated variance of zˆ ( k ). Defne z 0 to be the true value of intervention effect. Under linear mixed-effect models, Z( k ) has the following properties [26]: E(Z( k ) ) = z 0 / s z( k ) , Var( Z( k ) ) = 1, and Cov(Z( k ) , Z( k ¢) ) = s z( k ¢) / s z( k ) , for k, k¢ Î H and k £ k¢. Defne Z = (Z(1) , ˜ , Z( K ) )¢. It is obvious that Z follows a multivariate normal distribution and its density function, denoted by f( Z) , can be obtained based on the above properties. Under group sequential design, futility boundaries { f1 ,˜, f K } and effcacy boundaries {e1 , ˜, eK } are specifed, and the trial proceeds as follows: For k = 1, ˜ , K , • if k < K and • Z( k ) £ f k , stop the trial early for futility and accept H 0 ; • Z( k ) > ek , stop the trial early for effcacy and reject H 0 ; • f k < Z( k ) £ ek , continue to interim Stage hk +1 . • if k = K and • Z( k ) £ f K , accept H 0 ; • Z( k ) > eK , reject H 0 ; • f K < Z( k ) £ eK , inconclusive. One common practice has been to set f K = eK , so that a conclusion is always reached at the end of study. The probability of any particular trial result can be calculated. For example, the overall type I error is

187

Mixed-Effect Model Approach and Adaptive Strategies

(

)

1 - P ÇKk=1 {Z( k ) £ ek }|H 0 = 1 -

e1

ò ò -¥

˜

eK



f0 (Z)dZ(1) ˜dZ( K ) ,

where f0 ( Z) denotes the density function of Z under the null hypothesis H 0 . The probability of early stop at Stage h2 for futility can be calculated by

(

)

P f1 < Z(1) £ e1 and Z(2) £ f 2 =

e1

f2

òò ò f1



-¥ -¥

˜

ò





f ( Z)dZ(1) ˜dZ( K ) .

The calculation of such probabilities allows researchers to evaluate the operating characteristics of SW trials with group sequential design, including power, type I error, probability of early stop, and expected number of measurements. Note that by setting e1 = ˜ = eK -1 = +¥ or f1 = ˜ = f K -1 = -¥ , researchers can dictate that the SW trial does not stop early for effcacy or futility, respectively. To maintain the overall power and type I error at desired levels, different approaches have been proposed to set the stopping boundaries {(ek , f k )}. Pocock [27] proposed to use a constant value C( P ). O’Brien and Fleming [28] proposed to set the boundary at interim Stage hk at C (OBF) / k . Wang and Tsiatis [24] generalized these ideas to the “D class”, C (WT )k D-0.5 , where D can take any value. In each case, the constant C is specifed separately for ek ’s and f k ’s to achieve the desired levels of overall power and type I error. Lan and DeMets [29] introduced the a -spending function, which is used to determine the amount of type I and type II error “spent” at a particular interim stage. First, we defne g k = s z2(K ) / s z2( k ) to be the information time fraction (ITF) at interim Stage hk . We use 1 - g and a to denote the desired levels of overall power and type I error. Then spending functions a * and g * can be any isotonic function on [0,1] with the constraints that a * (0) = g * (0) = 0, a * (1) = a , and g * (1) = g . We further defne qk1 = P( stop at Stage hk for efficacy|H 0 ), qk 2 = P( stop at Stage hk for futility|H1 ), which are the probabilities of committing type I and type II errors, respectively, at interim Stage hk (k = 1, ˜ , K ). Then the values of ek ’s and f k ’s can be obtained by successively solving equations q11 = a * (g1 ), q12 = g * ( g1 ), and qk1 = a * ( g k ) - a * (g k -1 ), qk 2 = g * ( g k ) - g * (g k -1 ),

188

Design and Analysis of Pragmatic Trials

for 2 £ k £ K - 1. For convenience, Grayling et al. [30] have forced eK = f K so that a decision is made at the fnal analysis. To prioritize the maintenance of type I error, they solved for eK from the equation: qK1 = a * (g K ) - a * (g K -1 ) = a - a * (g K -1 ). To explain the basic idea, in the above we have assumed the variance s z2 to be known. In practice, pre-trial estimates of s z2 can be obtained using data from prior studies. When a reliable estimate is unavailable, however, researchers have proposed to re-estimate s z2 at interim stages, and the stopping boundaries identifed based on pre-trial assumed variance need to be adjusted accordingly. For example, a quantile substitution adjustment procedure can be employed [31]. More details can be found in a study by Grayling et al. [30]. 6.3.1.1 Bayesian Adaptive Design for SW Trials For Bayesian adaptive design, we present an approach based on predictive probabilities (PP) for SW trials. This method [32] was frst developed in the context of phase II clinical trials, which examines accumulating data at interim steps to determine whether the trial should be stopped early due to overwhelming effcacy/futility or should continue. By constructing decision rules based on PP, this adaptive approach mimics typical clinical decisionmaking processes. Specifcally, conditioning on interim data, the chance (predictive probability) that the trial reaches a conclusion of effectiveness at the planned end of the study is evaluated. The decision to continue or to stop is made according to the magnitude of this predictive probability, examined against pre-specifed thresholds. We demonstrate that this PP approach can be incorporated into SW trial design. In addition to the fexibility of early stopping, the Bayesian PP framework allows information from previous studies or literatures to be formally incorporated into design and analysis through the specifcation of priors [33, 34]. Furthermore, it can be easily extended to a decision analysis framework where patient/stakeholder utilities are incorporated into the SW trial design [35]. Without loss of generality, we use Model (6.16) as an example to describe the application of Bayesian PP approach to SW trials. At each of the interim stages hk (k = 1, ˜ , K ), the predictive probability of declaring the intervention effective (rejecting H 0 : z = 0) at the end of the trial given current observations will be calculated to inform decisions. The key idea lies in the calculation of these predictive probabilities. To facilitate discussion, we introduce Y to denote the likelihood model where Y repa few notations. We use [Y|Y] resents the collection of observed data and Y the set of unknown parameters. For example, under the linear mixed-effect model (6.1), Y denotes the set of continuous outcome measurements Yijt ’s, [Y|Y] Y is a multivariate

Mixed-Effect Model Approach and Adaptive Strategies

189

normal model, and Y includes treatment effect z , step-specifc intercepts l = (l1 , ˜ , lT )¢ , random effects t i ’s, q ij ’s, and hit ’s, residual variance s 2 , and all the other hyper-parameters (such as variance parameters for the random effects). Under the Bayesian paradigm, unknown parameters are treated as random variables and assigned prior distributions. We use [Y ] to denote the prior model. For example, it has been a common practice to assign the normal prior and inverse-gamma prior to regression coeffcients (z and l ) and variance parameter s 2 , respectively. In the following, we present the Bayesian PP approach for SW trials using the generic framework of [Y|Y] Y and [Y]. Through different specifcations of [Y|Y] Y and [Y ], this framework accommodates various types of outcomes (e.g., logistic model for binary outcomes, Poisson model for count outcomes) and different types of SW trials (e.g., cross-sectional and closed-cohort SW trials can be accommodated by different specifcations of random effects and their correlation structures). First, we defne the collection of accumulated observations up to Step hk by Y ( k ) = {Yij( k ) , i = 1,˜ , n; j = 1, ˜ , J}. Note that Yij( k ) = (yij1 , ˜ , yijhk )¢ denotes the observations from the jth subject of the ith cluster up to Step hk . Hence, the fully observed data at the end of the study (T = hK ) is denoted by Y ( K ) . We also defne Y ( -k) such that Y ( K ) = Y ( k ) È Y (-k) and Y ( k ) Ç Y (-k) = Æ. That is, Y (-k) contains the collection of observations to be measured from Steps (hk + 1) to T. Bayesian inference about unknown parameters Y is based on its posterior distribution, obtained by [Y| Y Y] =

ò

Y × [Y ] [Y |Y]

Y × [Y]dY Y [Y|Y]

µ [Y|Y]×[ Y Y ].

Conceptually, the prior distribution [Y ] represents researchers’ opinion about the parameters before observing any data from the trial, likelihood [Y|Y] Y represents information about the parameters contained in observations from this trial, and posterior distribution [Y|Y] Y represents researchers’ updated opinion after the trial. One advantage of the Bayesian approach is that, through the specifcation of prior distributions, it allows researchers to incorporate prior knowledge about treatments into trial design and analysis [36]. On the other hand, when there is no reliable information available at the design stage or inference based on observations from the current study only is desired, non-informative priors can be employed. The hypothesis of interest is H 0 : z £ c vs H1 : z > c, where c is the benchmark intervention effect. At the end of the study, given full observations Y (T ) , conclusions are made based on the posterior probability of z exceeding the pre-specifed threshold c,

190

Design and Analysis of Pragmatic Trials

ò

P(z > c|Y (T ) ) = [z |Y (T ) ]× I (z > c)dz . Here I(×) is an indicator function and [z |Y (T ) ] is the marginal posterior distribution of z , obtained from [Y|Y] Y by integrating out all other parameters. The fnal decision declares the intervention • effective if P(z > c|Y (T ) ) > xU ; • ineffective if P(z > c|Y (T ) ) < x L; • inclusive otherwise. Here xU and x L are decision thresholds, pre-specifed to achieve desired operational characteristics such as power and type I error. In practice, due to model complexity, the posterior distribution [Y|Y] Y usually does not have a closed form. The Markov chain Monte Carlo (MCMC) algorithm [37, 38] has been widely used to generate random samples from [Y Y | Y], based on which posterior means, posterior variances, and various posterior probabilities can be obtained for statistical inference. To incorporate Bayesian adaptive strategy into SW trials, at interim Stage hk , given accumulated observations Y ( k ), we calculate the predictive probability (denoted by PPk ) of declaring the intervention effective at the end of study. The magnitude of PPk serves as decision criteria based on which researchers would stop the trial early due to overwhelming evidence of effcacy/futility or continue the trial to the next stage because the current evidence is not yet conclusive, according to the following interim decision rules: At Step hk , • if PPk < pL , stop the trial and conclude the intervention ineffective; • if PPk > pU , stop the trial and conclude the intervention effective; • otherwise continue to the next interim Stage (hk +1 ) until reaching the end of study. The interim decision thresholds (pL ,pU ) also need to be pre-specifed. In the following, we describe the defnition of PPk . At the end of study, given full data Y (T ) = Y ( k ) È Y (-k) , we defne the decision function of declaring the intervention effective, D(Y ( k ) ,Y (-k ) ) = I{P(z > c|Y (T ) ) > xU }. That is, the intervention is declared effective if D(Y ( k ) ,Y (-k) ) = 1. At interim Stage hk , however, Y (-k) has not been observed yet, which prevents direct evaluation of D(Y ( k ) ,Y (-k) ). The workaround is to make interim decision based on the predictive probability of declaring the intervention effective given interim data Y ( k ), defned by

Mixed-Effect Model Approach and Adaptive Strategies

191

ò

PPk = D(Y ( k ) ,Y (-k) )×[Y (-k) |Y ( k ) ]dY (-k) . Here [Y (-k) |Y ( k ) ] is the predictive distribution of future observations Y ( -k) given Y ( k ). Specifcally,

ò

[Y (-k) |Y ( k ) ] = [Y (-k) |Y Y ] × [Y Y|Y ( k ) ]dY , where [Y ( -k ) |Y] Y is the distribution function of future observations Y (-k ) given parameters Y Y, and [Y|Y Y ( k ) ] is the posterior distribution of model parameters Y given interim observations Y ( k ). Because distribution functions such as [Y (-k) |Y ( k ) ] and [Y|Y Y ( k ) ] usually do not have closed forms, the evaluation of PPk is accomplished through a series of numerical integrations: 1. Based on interim observations Y ( k ), generate L random samples from posterior distribution [Y|Y Y ( k ) ] through MCMC (Markov chain Monte Carlo) simulation, denoted by {Y Y (1) ,˜, Y ( L ) }. The Gibbs sampling approach [39] has been widely employed, which iteratively samples from the full conditional distributions of individual parameters. 2. For l = 1, ˜ , L , plug the posterior sample Y ( l ) obtained from Step 1 into [Y ( -k ) |Y] Y and generate Yˆ (-k ), l from [Y (-k) |Y Y ( l ) ]. This procedure (-k ), l effectively integrates out Y Y, and {Yˆ , l = 1, ˜ , L} are considered (-k) (k ) random samples from [Y |Y ]. 3. Treat Yˆ (T ), l = Y (k) È Yˆ (-k ), l as a “full” dataset. Then PPk is numerically obtained as the average of D(Y ( k ) ,Yˆ (-k),l ) over l = 1, ˜ , L. Importantly, the value of D(Y ( k ) ,Yˆ (-k),l ) is determined by

ò

P(z > c | Yˆ (T ),l ) = I(z > c)P(z | Yˆ (T ),l )dz .

(6.17)

Here P(z | Yˆ (T ),l ) is the margin posterior distribution of z given Yˆ (T ), l , obtained from the joint posterior distribution P(Y Y | Yˆ (T ),l ) . In practice, for every sample Yˆ (-k ), l , the evaluation of (6.17) might require a separate numerical integration based on MCMC simulation from P(Y Y | Yˆ (T ),l ) . The full algorithm to implement a Bayesian adaptive SW trial based on the monitoring of predictive probabilities is described below. It has been common practice to start adaptation after a few steps, so enough data have been collected to avoid premature decisions due to spurious results from a small

192

Design and Analysis of Pragmatic Trials

amount of observations. Without loss of generality, we assume that adaptations are planned at steps hk (k = 1, ˜ , K ) with h1 > 1 and hK = T. The trial proceeds as follows. 1. At Step hk , obtain interim observations Y ( k ). For the lth (l = 1, ˜ , L ) iteration, (a) Generate a sample of model parameters from posterior distribuY ( k ) ], denoted by Y ( k , l ) . tion [Y|Y (b) Generate a sample of future observations from predictive distribution [Y (-k) |Y ( k ) ], denoted by Yˆ (-k ), l . (c) Construct a “full" dataset Yˆ (T ), l = Y (k) È Yˆ (-k),l . Generate Q samples of model parameters from [Y Y | Yˆ (T ),l ] , denoted by ˆ ( k , l ), q , q = 1, ˜,Q . Here we use accent Ù to indicate that the Y

{

(d) (e)

}

samples are generated based on simulated future observations ˆ ( k , l ), q . Yˆ ( -k ), l . Note that zˆ ( k , l ), q is an element of Y Q 1 I zˆ ( k , l ), q > c . Numerically evaluate P z > c|Yˆ (T ), l » q=1 Q Obtain D(Y ( k ) ,Yˆ (-k),l ) = I P z > c|Yˆ (T ),l > xU .

(

)

{(

å ( ) }

)

2. The predictive probability of declaring the intervention effective at the end of study given interim data Y ( k ) is calculated by PPk »

1 L

L

åD(Y

(k )

,Yˆ (-k),l ).

l=1

3. Adaptations are implemented according to decision rules: (a) If PPk < pL , stop the trial early and conclude the intervention ineffective; (b) If PPk > pU , stop the trial early and conclude the intervention effective; (c) Otherwise set k = k + 1. If k £ K , go to #1. If k > K , continue the trial to the end of study. 4. Obtain full observations Y (T ) . (a) (b) (c)

Y (T ) ], Generate Q samples of z from posterior distribution [Y|Y denoted by {z q , q = 1, ˜ , Q}. Q 1 Numerically evaluate P z > c|Y (T ) » I zq >c . q=1 Q If P z > c|Y (T ) > xU , declare the intervention effective; If P z > c|Y (T ) < x L , declare the intervention ineffective; Otherwise declare the trial inconclusive.

(

(

)

)

(

)

å (

)

Mixed-Effect Model Approach and Adaptive Strategies

193

Design parameters related to Bayesian adaptation include treatment effect benchmark c, interim decision thresholds p L and p U , fnal decision thresholds x L and xU , as well as the choice of interim steps hk , k = 1, ˜ , K . In practice, the values of these parameters are specifed through numerical search to achieve desired operational characteristics. Some researchers have recommended that Bayesian methods should be calibrated to have good frequentist properties [40, 41]. Following this spirit and adhering to regulatory practice, FDA guideline [42] has recommended that frequentist operational characteristics such as type I error and power be assessed for Bayesian trial design. Other frequently reported operational metrics include the probabilities of early stop (due to effcacy and/or futility), expected length of trial, and expected sample size. In the above, we have constructed Bayesian adaptive strategy based on predictive probabilities. An alternative approach is to directly utilize posterior probability P(z > c|Y (t ) ) , which has been explored in literature [33, 43]. The adaptive decisions investigated only involve whether to stop the trial early for overwhelming effcacy or futility, which is in effect equivalent to the group sequential design [23]. Within the context of SW trials, adaptive strategies can also include the adjustment of crossover probabilities (for example, according to the currently estimated treatment effect), or the adjustment of future panel sizes in cross-sectional SW trials. It has been demonstrated that the group sequential adaptive strategy has the potential to substantially reduce the sample size and length of SW trials when the null or alternative hypothesis is true [30]. One reservation about incorporating adaptive strategies into SW trials is that early stopping of SW trials might be unlikely because researchers conducting SW trials often hold the prior belief that the intervention is effective [44]. A recent literature review [45], however, reported that 31% of SW trials did not fnd a signifcant effect of the experimental intervention on any of the primary outcomes. This observation offers justifcation for the investigation of adaptive strategies for SW trials.

6.4 Further Readings We have reviewed various design and analysis issues for the application of SW trials in pragmatic clinical studies. Both GEE and mixed-effect model approaches are presented. We have presented closed-form sample size solutions that are fexible to account for design issues frequently encountered by biostatisticians, including unbalanced randomization, missing data, complicated correlation structures, and randomly varying cluster sizes. We further present frequentist and Bayesian strategies to incorporate adaptive decisions into SW trials, which can potentially shorten trial duration and

194

Design and Analysis of Pragmatic Trials

improve patient safety. We demonstrate that through different specifcations of the treatment sequence vectors assigned to clusters, the presented theoretical framework readily accommodates other types of multi-period cluster randomized trials, including longitudinal cluster randomized trial (LCRTs) and crossover cluster randomized trials (CCRTs). Finally, we have reviewed correlation structures that have been frequently assumed for observations from SW trials. They can be specifed through random effects in mixedeffect models, but also can be directly specifed as the marginal distribution under GEE. Currently one under-developed research area is design methods for opencohort SW trials. A systematic review [46] reported that out of 37 SW trials conducted between 2010 and 2104, 11 (30%) are open-cohort. Another systematic review of SW trials over the period of 2012 to 2017 reported a proportion of 15.2% (7 out of 46). Despite its wide application, methodological development for the design of open-cohort SW trials has been relatively scarce. Kasza et al. [47] proposed a sample size calculation approach for open-cohort SW trials based on mixed-effect models. The basic model takes the same form as (6.1), Yijt = lt + uitz + t i + q ij + hit + eijt . It envisions a trial scenario where participants can fow into and out of clusters at each step, but the total number of participants from each cluster (denoted by J) that contribute measurements remains stable over the steps. Such a scenario can arise from different sampling schemes. For example, the “core scheme” involves a core group in each cluster who provide measurements through all steps, complemented by another group that each only participates in one step of the trial. The “closed population” scheme involves repeated sampling from a closed population. It can be shown that the clusJ 1 Yijt are suffcient statistics for the treatment effect ter-step means Yit = j=1 J z and follow model

å

Yit = lt + uitz + t i + q i* + hit + ei*t . Here

q i* =

1 J

å

J

q ij ~ N(0, s q2 / J)

j=1

and

eit* =

1 J

å

J

eijt ~ N(0, s e2 / J).

j=1

Furthermore, the covariance between cluster-step means is Cov(Yit , Yit¢ ) = s t2 + s q2

ni (t , t¢) , J2

where ni (t , t¢) is the number of participants in Cluster i who provide measurements in both Steps t and t¢, ni (t , t¢) = ni (t¢, t) £ J and ni (t , t) = J . Assuming

Mixed-Effect Model Approach and Adaptive Strategies

195

a constant number of overlapping participants, ni (t , t¢) = J * across clusters and steps, the variance of the estimated treatment effect can be analytically derived. Li et al. [2] presented an expression which has a form similar to (6.8),

Var(zˆ ) =

(s 2 / J )nT[k 1 + f ( ra - rb )][k 2 - f (T - 1)( ra - rb )] . (U 2 + nTU - TW - nV)[k 2 - f (T - 1)( ra - rb )] - (U 2 - nV)[k 1 + f ( ra - rb )] (6.18)

Note that the defnitions of s 2 , k1 , k 2 , ra , rb , and rw are the same as in Section 6.2.1.3, design constants U , V , and W were defned in (6.8), and f = 1 - J * / J is the common rate of attrition or churn rate. The derivation of Var(zˆ ) leads to a closed-form sample size formula. Importantly, as f ® 1, the open-cohort design approaches the cross-sectional design and (6.18) approaches the variance derived under the nested exchangeable correlation structure, while as f ® 0 , the open-cohort design approaches the closed-cohort design, and (6.18) approaches the variance derived under the block exchangeable correlation structure. Kasza et al. [47] also introduced an extension that allows correlations to decay in open-cohort SW trials. Further developments are needed for the design of open-cohort trials with non-constant churn rates (for example, increasing with |t¢ - t|), random varying cluster sizes, and other types of outcomes (such as binary and count). In designing SW trials, the most commonly considered scheme is called “uniform allocation”, where an equal number of clusters switch from control to intervention. It is not obvious that uniform allocation is the best strategy. The effciency of SW trials and their optimal design have been investigated by several researchers. Lawrie et al. [48] explored optimal design for crosssectional SW trials based on a mixed-effect model, Yijt = lt + uitz + t i + eijt , which is the same as (6.7). It only considers cluster random effects t i . The variance of the estimated treatment effect Var(zˆ ) has a closed form (6.8). Here for convenience, we rewrite (6.8) as Var(zˆ ) =

ns 2 (1 + r (JT - 1)) . (1 + r (JT - 1))(nU - W) + J r (U 2 - nV)

Recall that s 2 = s t2 + s e2 , r = s t2 / s 2 , and U , V ,W are various design constants calculated based on uit’s. Defne pt (t = 2,˜ , T ) to be the proportion of clusters that are randomized to switch to intervention at Step t. The basic idea of fnding the optimal allocation scheme is to fnd a particular p = (p2 , ˜ , pT )¢ that minimizes Var(zˆ ), given design confguration (n, J , T , s 2 , r ). Note that

196

Design and Analysis of Pragmatic Trials

p affects Var(zˆ ) through uit ’s, which subsequently determine the values of (U , V, W ) . Lawrie et al. [48] showed that optimality is achieved only at p2opt = pTopt = ptopt =

1 + r (3J - 1) , 2[1 + r (JT - 1)]

Jr , for t = 3,˜, T - 1. 1 + r (JT - 1)

It is obvious that p2opt and pTopt are larger than ptopt (t = 3, ˜ , T - 1), hence the “uniform allocation” is not optimal for T > 3. Under T = 3, it can be shown 1 that p2opt = p3opt = , which is uniform allocation. They further showed that the 2 effciency of uniform allocation can be poor for small values of r , particularly when T is large. Based on Model (6.1), Girling and Hemming [15] attempted to fnd the optimal design in a larger space (called a hybrid design), which allows some clusters to be randomized to a parallel clustered trial and the remaining clusters to an SW trial. They presented a formula for design effect that applies to both cross-sectional and closed-cohort studies, and found that the optimal proportion of clusters to randomize to the SW trial equals to the cluster-mean correlation. Thompson et al. [49], on the other hand, used a different design effect to fnd the optimal combination of number of sequences and proportion of observations before and after treatment switchover. In cross-sectional SW trials, two levels of clustering (subjects within clusters) have been considered. It is likely, however, that more than two levels of clustering are involved in SW trials. For example, the CHANGE trial (ClinicalTrial.gov NCT02817282) randomizes nursing homes (level 4) consisting of nursing home wards (level 3) in which nurses (level 2) are observed whether they comply with hand hygiene guidelines over multiple sessions (level 1). Depending on which levels are repeatedly involved, there can be different scenarios. (a) Level 4 repeated: the same nursing homes are involved all through, but at each step different wards (and consequently, different nurses and sessions) are involved, implying cross-section measurements at lower levels (1, 2, and 3). (b) Levels 4 and 3 repeated: the same wards from the same nursing homes are involved all through, but at each step different nurses are enrolled, implying cross-section measurements at levels 1 and 2. (c) Levels 4, 3, and 2 repeated: the same nurses (hence the same wards and nursing homes) are repeatedly measured over sessions, implying cross-sectional measurements at level 1. Randomization occurs at the highest level, which is always repeatedly involved. To calculate the sample size requirement for SW trials with > 2 levels of clustering, Teerenstra et al. [50] employed a mixed-effect model that extends (6.7) by incorporating random effects for each clustering level. They assumed that each level-h unit contains the same number of level-(h - 1) units (h = 2, ˜ , H ), where H is the

197

Mixed-Effect Model Approach and Adaptive Strategies

total number of levels. Defne Yit to be the cluster-step mean for Cluster i at Step t. The model implies equal variance, denoted by Var(Yit ) = f 2 , and equal covariance, denoted by Cov(Yit , Yit¢ ) = w 2 , t ¹ t¢ . The basic idea of Teerenstra et al. [50] is to derive f 2 and w 2 for particular scenarios, and the variance of estimated treatment effect can be obtained by a general formula developed by Hussey and Hughes [10], Var(zˆ ) =

å å

nf 2 (f 2 + Tw 2 ) , (nU - W)f + (U 2 + nTU - TW - nV )w 2 2

å å

2

å å

2

n n T æ æ uit ö÷ , and V = uit ÷ö . The ç ç i=1 t=1 t=1 è i=1 i=1 t=1 ø è ø clustering structure (number of levels and size at each level) affects f 2 , and which levels are involved repeatedly affects w 2 . Teerenstra et al. [50] further demonstrated that a variance infation factor can be obtained for sample size calculation. It can be presented as the product of two variance infation terms: one due to multi-level clustering and the other due to the treatment switchover scheme under an SW trial.

with U =

n

T

uit , W =

T

References 1. C.M. Sitlani, P.J. Heagerty, E.A. Blood, and T.D. Tosteson. Longitudinal structural mixed models for the analysis of surgical trials with noncompliance. Statistics in Medicine, 31(16):1738–1760, 2012. 2. F. Li, J.P. Hughes, K. Hemming, M. Taljaard, E.R. Melnick, and P.J. Heagerty. Mixed-effects models for the design and analysis of stepped wedge cluster randomized trials: An overview. Statistical Methods in Medical Research, 30(2):612– 639, 2021. 3. H.D. Patterson, and R. Thompson. Recovery of inter-block information when block sizes are unequal. Biometrika, 58(3):545–554, 1971. 4. D.A. Harville. Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association, 72(358):320–338, 1977. 5. J.T. Martin, K. Hemming, and A. Girling. The impact of varying cluster size in cross-sectional stepped-wedge cluster randomised trials. BMC Medical Research Methodology, 19(1):1–11, 2019. 6. C.A. Kristunas, K.L. Smith, and L.J. Gray. An imbalance in cluster sizes does not lead to notable loss of power in cross-sectional, stepped-wedge cluster randomised trials with a continuous outcome. Trials, 18(1):1–11, 2017. 7. R. Hooper, S. Teerenstra, E. de Hoop, and S. Eldridge. Sample size calculation for stepped wedge and other longitudinal cluster randomised trials. Statistics in Medicine, 35(26):4718–4728, 2016.

198

Design and Analysis of Pragmatic Trials

8. S. Teerenstra, S. Eldridge, M. Graff, E. de Hoop, and G.F. Borm. A simple sample size formula for analysis of covariance in cluster randomized trials. Statistics in Medicine, 31(20):2169–2178, 2012. 9. R. Hooper, and L. Bourke. Cluster randomised trials with repeated cross sections: Alternatives to parallel group designs. BMJ: British Medical Journal, 350, 2015. 10. M.A. Hussey, and J.P. Hughes. Design and analysis of stepped wedge cluster randomized trials. Contemporary Clinical Trials, 28(2):182–191, 2007. 11. W. Woertman, E. de Hoop, M. Moerbeek, S.U. Zuidema, D.L. Gerritsen, and S. Teerenstra. Stepped wedge designs could reduce the required sample size in cluster randomized trials. Journal of Clinical Epidemiology, 66(7):752–758, 2013. 12. K. Hemming, and A. Girling. A menu-driven facility for power and detectabledifference calculations in stepped-wedge cluster-randomized trials. The STATA Journal: Promoting Communications on Statistics and Stata, 14(2):363–380, 2014. 13. K. Hemming, R. Lilford, and A.J. Girling. Stepped-wedge cluster randomised controlled trials: A generic framework including parallel and multiple-level designs. Statistics in Medicine, 34(2):181–196, 2015. 14. G. Baio, A. Copas, G. Ambler, J. Hargreaves, E. Beard, and R.Z. Omar. Sample size calculation for a stepped wedge trial. Trials, 16(1):1–15, 2015. 15. A.J. Girling, and K. Hemming. Statistical effciency and optimal design for stepped cluster studies under linear mixed effects models. Statistics in Medicine, 35(13):2149–2166, 2016. 16. F. Li, E.L. Turner, and J.S. Preisser. Sample size determination for GEE analyses of stepped wedge cluster randomized trials. Biometrics, 74(4):1450–1458, 2018. 17. J. Kasza, K. Hemming, R. Hooper, J.N.S. Matthews, and A.B. Forbes. Impact of non-uniform correlation structure on sample size and power in multiple-period cluster randomised trials. Statistical Methods in Medical Research, 28(3):703–716, 2019. 18. K.L. Grantham, J. Kasza, S. Heritier, K. Hemming, and A.B. Forbes. Accounting for a decaying correlation structure in cluster randomized trials with continuous recruitment. Statistics in Medicine, 38(11):1918–1934, 2019. 19. A. Liu, W.J. Shih, and E. Gehan. Sample size and power determination for clustered repeated measurements. Statistics in Medicine, 21(12):1787–1801, 2002. 20. M. Lefkopoulou, D. Moore, and L. Ryan. The analysis of multiple correlated binary outcomes: Application to rodent teratology experiments. Journal of the American Statistical Association, 84(407):810–815, 1989. 21. F. Li. Design and analysis considerations for cohort stepped wedge cluster randomized trials with a decay correlation structure. Statistics in Medicine, 39(4):438–455, 2020. 22. S.-C. Chow, M. Chang, and A. Pong. Statistical consideration of adaptive methods in clinical development. Journal of Biopharmaceutical Statistics, 15(4):575–591, 2005. 23. K.K.G. Lan, and D.L. DeMets. Group sequential procedures: Calendar versus information time. Statistics in Medicine, 8(10):1191–1198, 1989. 24. S.K. Wang, and A.A. Tsiatis. Approximately optimal one-parameter boundaries for group sequential trials. Biometrics, 43(1):193–199, 1987. 25. Q. Liu, M.A. Proschan, and G.W. Pledger. A unifed theory of two-stage adaptive designs. Journal of the American Statistical Association, 97(460):1034–1041, 2002.

Mixed-Effect Model Approach and Adaptive Strategies

199

26. C. Jennison, and B.W. Turnbull. Group Sequential Methods with Applications to Clinical Trials. CRC Press, 1999. 27. S.J. Pocock. Group sequential methods in the design and analysis of clinical trials. Biometrika, 64(2):191–199, 1977. 28. P.C. O’Brien, and T.R. Fleming. A multiple testing procedure for clinical trials. Biometrics, 35(3):549–556, 1979. 29. K.K.G. Lan, and D.L. DeMets. Discrete sequential boundaries for clinical trials. Biometrika, 70(3):659–663, 1983. 30. M.J. Grayling, J.M.S. Wason, and A.P. Mander. Group sequential designs for stepped-wedge cluster randomised trials. Clinical Trials, 14(5):507–517, 2017. 31. J. Whitehead, E. Valdés-Márquez, and A. Lissmats. A simple two-stage design for quantitative responses with application to a study in diabetic neuropathic pain. Pharmaceutical Statistics: The Journal of Applied Statistics in the Pharmaceutical Industry, 8(2):125–135, 2009. 32. J.J. Lee, and D.D. Liu. A predictive probability design for phase II cancer clinical trials. Clinical Trials, 5(2):93–106, 2008. 33. P.F. Thall, and R. Simon. A Bayesian approach to establishang sample size and monitoring criteria for phase II clinical trials. Controlled Clinical Trials, 15(6):463– 481, 1994. 34. G. Yin, N. Chen, and J.J. Lee. Phase II trial design with Bayesian adaptive randomization and predictive probability. Journal of the Royal Statistical Society: Series C (Applied Statistics), 61(2):219–235, 2012. 35. Z. Chen, Y. Zhao, Y. Cui, and J. Kowalski. Methodology and application of adaptive and sequential approaches in contemporary clinical trials. Journal of Probability and Statistics, 2012, 2012. 36. Z. Zhang, F. Hamagami, L.L. Wang, J.R. Nesselroade, and K.J. Grimm. Bayesian analysis of longitudinal data using growth curve models. International Journal of Behavioral Development, 31(4):374–383, 2007. 37. A.E. Gelfand, and A.F.M. Smith. Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410):398–409, 1990. 38. B.A. Berg. Markov Chain Monte Carlo Simulations and Their Statistical Analysis: With Web-Based Fortran Code. World Scientifc Publishing Company, 2004. 39. S. Geman, and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. In Martin A. Fischler and Oscar Firschein (eds.) Readings in Computer Vision (pp. 564–584). Elsevier, 1987. 40. G.E.P. Box. Sampling and Bayes’ inference in scientifc modelling and robustness. Journal of the Royal Statistical Society: Series A (General), 143(4):383–404, 1980. 41. D.B. Rubin. Bayesianly justifable and relevant frequency calculations for the applied statistician. The Annals of Statistics, 12(4):1151–1172, 1984. 42. FDA. Guidance for the use of Bayesian statistics in medical device clinical trials, 2010. https://www.fda.gov/regulatory-information/search-fda-guidance -documents/guidance-use-bayesian-statistics-medical-device-clinical-trials#4. 43. S.-B. Tan, and D. Machin. Bayesian two-stage designs for phase II clinical trials. Statistics in Medicine, 21(14):1991–2012, 2002. 44. E. de Hoop, I. van der Tweel, R. van der Graaf, K.G.M. Moons, J.J.M. van Delden, J.B. Reitsma, et al. The need to balance merits and limitations from different disciplines when considering the stepped wedge cluster randomized trial design. BMC Medical Research Methodology, 15(1):1–13, 2015.

200

Design and Analysis of Pragmatic Trials

45. M.J. Grayling, J. Wason, and A.P. Mander. Stepped wedge cluster randomized controlled trial designs: A review of reporting quality and design features. Trials, 18(1):1–13, 2017. 46. E. Beard, J.J. Lewis, A. Copas, C. Davey, D. Osrin, G. Baio, et al. Stepped wedge randomised controlled trials: Systematic review of studies published between 2010 and 2014. Trials, 16(1):1–14, 2015. 47. J. Kasza, R. Hooper, A. Copas, and A.B. Forbes. Sample size and power calculations for open cohort longitudinal cluster randomized trials. Statistics in Medicine, 39(13):1871–1883, 2020. 48. J. Lawrie, J.B. Carlin, and A.B. Forbes. Optimal stepped wedge designs. Statistics and Probability Letters, 99:210–214, 2015. 49. J.A. Thompson, K. Fielding, J. Hargreaves, and A. Copas. The optimal design of stepped wedge trials with equal allocation to sequences and a comparison to other trial designs. Clinical Trials, 14(6):639–647, 2017. 50. S. Teerenstra, M. Taljaard, A. Haenen, A. Huis, F. Atsma, L. Rodwell, et al. Sample size calculation for stepped-wedge cluster-randomized trials with more than two levels of clustering. Clinical Trials, 16(3):225–236, 2019.

Index A Acute Coronary Syndrome Quality Improvement in Kerala (ACS QUIK) trial, 23 Adaptive strategies, stepped-wedge (SW) trial design, 185 Bayesian adaptive design, 188–193 group sequential design, 185–188 readings, 193–197 Adjusted chi-square test, 45–47 Adjusted normality test, 53–55 Adjusted two-sample t-test, 35–38 Analysis of variance (ANOVA), 34, 45 AR(1) structure, 121 B Bayesian adaptive design adaptations, 191–192 adaptive strategy, 193 distribution functions, 191 example, 188–189 hypothesis of interest, 189 interim Stage, 190 Markov chain Monte Carlo (MCMC) algorithm, 190 numerical integrations, 191 observations, 189–190 predictive probabilities (PP), 188, 193 trial proceeds, 192 Binary outcomes, 43–44 cluster randomized trial adjusted chi-square test, 45–47 generalized estimating equation approach, 48–51 generalized linear mixed model approach, 51–52 ratio estimator chi-square test, 47–48 standard Pearson chi-square test, 44 stepped-wedge trials, 142–145 derivation of equation, 166–167

extension outcomes, exponential family, 145–148 stratifed cluster randomized design estimation of clustering parameter, 109–111 example, 111–112 relative sample size change, varying cluster size, 109 sample size estimation, CMH statistic, 105–109 Block exchangeable correlation structure, 182–183 C Cancer risk intake system (CRIS) trial, 9 CCRTs, see Crossover cluster randomized trials CHAT, see Community Hypertension Assessment Trial Closed cohort SW trials, 120–121 Cluster-level analysis, 33, 43 Cluster randomization, 7 Cluster randomized trial (CRT), 4–8, 31–32 binary outcomes, 43–44 adjusted chi-square test, 45–47 generalized estimating equation approach, 48–51 generalized linear mixed model approach, 51–52 ratio estimator chi-square test, 47–48 standard Pearson chi-square test, 44 cluster size determination, fxed number of clusters, 59–60 completely randomized cluster trial design, 8–10 continuous outcomes, 32–34 adjusted two-sample t-test, 35–38 generalized estimating equation method, 39–41

201

202

mixed-effects linear regression models, 41–43 standard two-sample t-test, 35 count outcomes, 52–53 adjusted normality test, 53–55 generalized estimating equation, 56–58 ratio estimator method, 55–56 multiple-period cluster randomized designs, 17–18 crossover cluster randomized trials, 19, 20–21 longitudinal cluster randomized design, 18–20 stepped-wedge cluster randomized trials, 21–23 restricted randomized designs, 10–11 covariate-constrained randomized design, 16–17 matched-pair cluster randomized design, 11–14 stratifed cluster randomized design, 14–16 Cluster size, 97–98 Cluster size determination, 59–60 Cochran–Mantel–Haenszel (CMH) statistics, 98, 105–106 Community Hypertension Assessment Trial (CHAT), 86, 88 Continuous outcomes, 32–34 cluster randomized trial adjusted two-sample t-test, 35–38 generalized estimating equation method, 39–41 mixed-effects linear regression models, 41–43 standard two-sample t-test, 35 stepped-wedge trials, 125–132 accounting missing data, 132–134 adjusting underestimated variances, small sample sizes, 137–138 consideration of, effciency and robustness, 138–141 derivation of equations, 163–165 simulation research, 134–137 stratifed cluster randomized design example, 103–105 relative sample size change, varying cluster size, 103

Index

sample size estimation, GEE approach, 100–103 Correction factor, 45 Correlation structure, 179 block exchangeable correlation structure, 182–183 exchangeable correlation structure, 179–180 exponential decay correlation structure, 183–184 nested exchangeable correlation structure, 180–182 proportional decay correlation structure, 184–185 Count outcomes, CRT, 52–53 adjusted normality test, 53–55 generalized estimating equation, 56–58 ratio estimator method, 55–56 Covariate-constrained randomization, 16–17, 90 CRIS, see Cancer risk intake system trial Crossover cluster randomized trials (CCRTs), 18–21, 149–151, 154–156 Cross sectional SW trials, 120–121 CRT, see Cluster randomized trial D Design effect (DE), 8, 32, 48 E Effective sample size, 48 EHR, see Electronic health record Eigenvalues, correlation matrix, 91 Electronic health record (EHR), 2, 112 Exchangeable correlation structure, 179–180 Exponential decay correlation structure, 183–184 G Generalized estimating equation (GEE) approach, 9, 12, 15, 39–41, 56–58, 66, 89, 98, 121 vs. crude adjustment, relative effciency

203

Index

with missing binary outcomes, 83–84 with missing continuous outcomes, 72–74 sample size estimation with missing binary outcomes, 80–83 with missing continuous outcomes, 68–72 Generalized estimating equation (GEE) approach, stepped-wedge (SW) trials design advantages of, 119–120 Bs and relevant terms, distributions in exponential family, 148 derivation of equation, 161–163 design SW trials, binary outcome, 142–145 derivation of equation, 166–167 extension outcomes, exponential family, 145–148 design SW trials, continuous outcome, 125–132 accounting missing data, 132–134 adjusting underestimated variances, small sample sizes, 137–138 consideration of, effciency and robustness, 138–141 derivation of equations, 163–165 simulation research, 134–137 diagrams of LCRT and CCRTm two sequences, 149 healthcare system, 119 longitudinal and crossover cluster randomized trials, 148–151 accounting randomly varying cluster sizes, 158–161 adjusting small numbers of clusters, t -Distribution, 157–158 crossover cluster randomized trials, 154–156 longitudinal cluster randomized trials, 152–154 nLCRT vs. nCCRT, 156–157 overview, 119–122 required number of clusters (empirical power, empirical type I error)

in CCRTs, with and without adjustment by t-distribution, 159–160 under normality assumption, 139 normality assumption is violated, 141 review of GEE, 122–125 sample size for cross-sectional SW trials, 165 types of, 120–121 Generalized linear mixed model (GLMM) approach, 15, 26, 51–52, 66 GoodNEWS, 59 Group sequential design, 185–188 H Head Positioning in Acute Stroke Trial (HeadPoST), 21 Healthcare system, 119 Hooper’s approach, 138–140 I ICC, see Intracluster correlation coeffcient Individualized randomized trial (IRT), 6, 7, 10, 32 Individual-level analysis, 33 Infated type I error with missing binary outcomes, 84–86 with missing continuous outcomes, 74–76 Information technology (IT), 103–105 Intraclass correlations, 66 Intracluster correlation, 41 Intracluster correlation coeffcient (ICC), 5, 7, 8, 16, 31, 32, 34, 97–99 IRT, see Individualized randomized trial IT, see Information technology L LCRT, see Longitudinal cluster randomized trial Liang’s chi-square test, 13 Longitudinal and crossover cluster randomized trials, 148–151

204

accounting randomly varying cluster sizes, 158–161 adjusting small numbers of clusters, t -Distribution, 157–158 crossover cluster randomized trials, 154–156 longitudinal cluster randomized trials, 152–154 nLCRT vs. nCCRT, 156–157 Longitudinal cluster randomized trial (LCRT), 18–20, 148–154 M Mantel–Haenszel chi-square test, 13, 15 MAR, see Missing at random Matched-pair cluster randomized design, 11–14, 65–66 correlation impact, 67 with missing binary outcomes example, 86–88 infated type I error adjustment, 84–86 relative effciency, GEE approach vs. crude adjustment, 83–84 sample size estimation based on GEE approach, 80–83 with missing continuous outcomes example, 78–80 infated type I error adjustment, 74–76 relative effciency, GEE approach vs. crude adjustment, 72–74 sample size estimation based on GEE approach, 68–72 sensitivity analysis, 77–78 missing data impact, 67 Matched-pair randomized design, 97 MCAR, see Missing completely at random Mexican Universal Health Insurance Program, 65 Missing at random (MAR), 4 Missing binary outcomes, matched-pair cluster randomized design example, 86–88 infated type I error adjustment, 84–86 relative effciency, GEE approach vs. crude adjustment, 83–84

Index

sample size estimation based on GEE approach, 80–83 Missing completely at random (MCAR), 4, 70, 89 Missing continuous outcomes, matchedpair cluster randomized design example, 78–80 infated type I error adjustment, 74–76 relative effciency, GEE approach vs. crude adjustment, 72–74 sample size estimation based on GEE approach, 68–72 sensitivity analysis, 77–78 Missing not at random (MNAR), 4 Misspecifed correlation structure, 77, 79 Misspecifed missing pattern, 77, 78 Mixed-effect model (MM), 9, 121 Mixed-effect model approach, steppedwedge (SW) trial design readings, 193–197 review of, 173–174 sample size calculation, cluster-step means, 174–179 block exchangeable correlation structure, 182–183 exchangeable correlation structure, 179–180 exponential decay correlation structure, 183–184 nested exchangeable correlation structure, 180–182 proportional decay correlation structure, 184–185 Mixed-effects linear regression models, 41–43 MNAR, see Missing not at random Morel, Bokossa, and Neerchal (MBN), 137 Multi-arm pragmatic trials, 114–115 Multiple-period cluster randomized designs, 17–18 crossover cluster randomized trials, 19, 20–21 longitudinal cluster randomized design, 18–20 stepped-wedge cluster randomized trials, 21–23

205

Index

N Nested exchangeable correlation structure, 180–182 nLCRT vs. nCCRT, 156 Theorem 1, 156–157 proof of, 167–168 Theorem 2, 157 proof of, 168–169 Non-Bayesian approach, 52 O Open-cohort SW trials, 120–121 Opt-out process, 5 P PCP, see Primary care provider Pragmatic clinical trials (PCTs), 1, 3, 6, 119 Pragmatic-Explanatory Continuum Indicator Summary (PRECIS) tool, 2 Pragmatic-Explanatory Continuum Indicator Summary version 2 (PRECIS-2), 2 Pragmatic randomized trials, 1–3 cluster randomized trial, 6–8 completely randomized cluster trial design, 8–10 multiple-period cluster randomized designs, 17–23 restricted randomized designs, 10–17 statistical issues, 3–6 PRECIS, see Pragmatic-Explanatory Continuum Indicator Summary tool PRECIS-2, see Pragmatic-Explanatory Continuum Indicator Summary version 2 Pre–post cluster experimental design, 90 Primary care providers (PCPs), 20, 38 Proportional decay correlation structure, 184–185 R Randomized clinical trials (RCTs), 1, 4 Ratio estimator chi-square test, 47–48

Ratio estimator method, 55–56 RCTs, see Randomized clinical trials Relative effciency GEE approach vs. crude adjustment with missing binary outcomes, 83–84 with missing continuous outcomes, 72–74 Reminder–recall (R/R) immunization trial, 17 Restricted randomized designs, 10–11 covariate-constrained randomized design, 16–17 matched-pair cluster randomized design, 11–14 stratifed cluster randomized design, 14–16 S Sample size estimation, GEE approach with missing binary outcomes, 80–83 with missing continuous outcomes, 68–72 Sample size ratio, 73, 84 Sensitivity analysis, 77–78 Size-stratifed cluster randomized design, 97–98 Standard Pearson chi-square test, 44 Standard two-sample t-test, 35 Stepped-wedge cluster randomized trials (SW-CRTs), 18, 21–23 Stratifed cluster randomized design, 14–16, 97–98 binary outcomes, 105 estimation of clustering parameter, 109–111 example, 111–112 relative sample size change, varying cluster size, 109 sample size estimation, CMH statistic, 105–109 considerations, 98–100 continuous outcomes, 100 example, 103–105 relative sample size change, varying cluster size, 103 sample size estimation, GEE approach, 100–103

206

Index

overview, 97–98 synopsis, 112–115 Stratifed randomized design, 97 Stratum-specifc effects, 15 SW-CRTs, see Stepped-wedge cluster randomized trials

V

U

W

Uniform allocation, 196

Wilcoxon signed-rank test, 12

Variance infation factor (VIF), 7, 32 Variance-weighted cluster-level analysis, 9 VIF, see Variance infation factor