### Review article

# Planning and Analysis of Trials Using a Stepped Wedge Design: Part 26 of a Series on Evaluation of Scientific Publications

## Part 26 of a serieson evaluation of scientific publications

#### ; ; ; ;

__Background__: The stepped-wedge design (SWD) of clinical trials has become very popular in recent years, particularly in health services research. Typically, study participants are randomly allotted in clusters to the different treatment options.

__Methods__: The basic principles of the stepped wedge design and related statistical techniques are described here on the basis of pertinent publications retrieved by a selective search in PubMed and in the CIS statistical literature database.

__Results__: In a typical SWD trial, the intervention is begun at a time point that varies from cluster to cluster. Until this time point is reached, all participants in the cluster belong to the control arm of the trial. Once the intervention is begun, it is continued without change until the end of the trial period. The starting time for the intervention in each cluster is determined by randomization. At the first time point of measurement, no intervention has yet begun in any cluster; at the last one, the intervention is in progress in all clusters. The treatment effect can be optimally assessed under the assumption of an identical correlation at all time points. A method is available to calculate the power and the number of clusters that would be necessary in order to achieve statistical significance by the appropriate type of significance test. All of the statistical techniques presented here are based on the assumptions of a normal distribution of cluster means and of a constant intervention effect across all time points of measurement.

__Conclusion__: The necessary statistical tools for the planning and evaluation of SWD trials now stand at our disposal. Such trials nevertheless are subject to major risks, as valid results can be obtained only if the far-reaching assumptions of the model are, in fact, justified.

The value of the principle of randomization to compare treatments and interventions remains undisputed in medical research, and randomized controlled trials (RCTs) are the acknowledged gold standard. Due to practical considerations, a number of variations have been developed in addition to the traditional RCT design, including cluster-randomized trials and the stepped-wedge design (SWD). In cluster-randomized, parallel-group trials—the prevailing type of cluster-randomized trial—groups of individuals (e.g. doctors’ practices, school classes, regions), rather than individuals themselves, are randomized to receive the intervention. These groups are generally—as well as in the rest of this article—referred to as clusters.

Basic principle, model assumptions, estimator of treatment effect

In SWD trials, all individuals or clusters are observed first for a certain period of time under control conditions and then under intervention conditions until the end of the trial. Randomization is used to decide when the transition to the intervention is made. The number of consecutive points in time at which the outcome variable is observed is identical for all clusters, except for cases with missing values. Individuals may either be treated only once (cross-sectional SWD) or switch from control treatment to the intervention during the trial (open- versus closed-cohort SWD). In principle, the unit of observation in an SWD may be either an individual or a cluster. In practice, however, SWD trials are usually conducted as an alternative to cluster-randomized trials.

In recent years, SWD trials have gained considerable popularity for planning scientific studies in medicine and health care research. This is reflected in the volume of medical literature on SWD trials: for example, a PubMed search using the keywords “stepped wedge” for a systematic review of the literature, covering publications from 2010 through 2014, yielded a total of 491 hits (1) (as of June 8, 2018). Among the health care research projects funded by the Innovation Fund of Germany’s Federal Joint Committee (G-BA, Gemeinsamer Bundesausschuss) since 2015, there are several trials following an SWD.

SWD trials were described in the literature on experimental design as early as the late 1970s (2). The first large-scale study conducted and termed an SWD trial dates from 1987 (3). In the course of that project, a large-scale vaccination program was implemented in Gambia, for which 17 teams were formed. All the teams initially started a standard vaccination program. Hepatitis vaccination was adopted gradually, by one team at a time. The aim was to have vaccinated all children against hepatitis B viruses (HBVs) after approximately 4 years. The main reason given for proceeding in this way was logistics, including vaccine availability. The outcome was evaluated in terms of the incidence of liver tumors. Indirect evidence that vaccination effectively reduced HBV infection had already been confirmed before in a number of studies in high-risk groups. It was also known that HBV infection was a risk factor for liver cancer. According to the authors of the trial, it would be valuable to obtain direct evidence that vaccination reduced the incidence of liver tumors. With respect to this trial, there was also debate as to whether a 4-year traditional parallel-group trial should be conducted instead of the SWD. However, there were many organizational arguments against this, so the SWD design was chosen.

SWD trials are often also referred to as unidirectional crossover trials (4). This can be explained by the schedule shown in Table 1 for clusters’ transition from the control arm to the intervention arm of the trial for the standard case of a 2-armed SWD trial: each cluster begins in the control arm (C). The transition to the intervention treatment (I) occurs at the latest at the last follow-up time. This means that the only possible combinations for 2 consecutive points in time are C-C, C-I, and I-I, whereas I-C is impossible. This means that unlike in true, bidirectional crossover trials (5) there are no observational units for which the outcome variable is are measured without intervention after the end of the trial’s intervention phase. Which cluster is allocated to which row of the scheme is determined by randomization. Table 1 shows a specific example, and one recognizes the stepped-wedge shape between control and intervention periods that gives SWD trials their name. The number of clusters per start time need not be restricted to one, but it should remain constant over time where possible.

**Table 1**

SWD trials are preferred over parallel-group or (true) crossover trials if it is assumed that the intervention will be considered worthwhile and beneficial, and those planning the trial cannot (or do not want to) justify interrupting the intervention once it has been started. The SWD also has the advantage that the intervention only needs to be started in a few clusters at once, which from an organizational perspective is often a very important factor. For example, in the trial conducted in Gambia described above it was not possible, for organizational reasons, to begin HBV vaccination in all 60 000 children (50% of the study population) at the same time.

**Table 2**

Table 2 shows optimum weighting for the trial design shown in Table 1, as an example. The results hold under the following simplifying conditions (4, 6):

- Condition 1: Analysis is performed in 2 steps; the first one consists of calculating averages for each cluster and point in time. Subsequent analysis relates to these aggregate values alone, and for these the basic distributional assumptions are required to hold.
- Condition 2: Cluster means are normally distributed (at least approximately) with a variance, being independent of both point in time and treatment.
- Condition 3: Cluster means are correlated between times at which parameters are measured. However, the magnitude of this correlation depends on neither temporal distance nor the type of treatment (control or intervention). In principle, correlations depend on whether and how often repeated measurements are taken from the same individual.
- Condition 4: As an average over the population of all clusters, the clusters’ arithmetical means are the sum of a period effect specific to the time at which parameters are measured and the time-independent effect (hereafter referred to as θ) of the treatment being investigated (the intervention).

Using these conditions, the standard error (stderr) of the optimum estimator of the treatment effect can be calculated exactly. A relatively simple formula can be used (Box 1) to do so for arbitrary numbers of intervention start times (T) and clusters (n) that transition from the control phase to the intervention phase at the same time. This formula can be used to calculate a confidence interval for the estimated treatment effect obtained by analyzing an SWD trial. The entries in the table in Box 1 show how the width of this confidence interval, and therefore the statistical precision of the estimator, is affected by the parameters underlying the trial design.

**Box 1**

Significance testing, power, and sample size planning

Just as simple as calculating the limits of confidence intervals is statistical testing of the null hypothesis that the “true” treatment effect θ (i.e. the effect without superposition of chance deviations) is 0.

When planning an SWD, it is important to realize that the procedure to be used for calculating power cannot be converted into a simple formula for the number n of clusters that start the intervention at the same time. As shown in the formula given in Box 1, the standard error of θ_{est}, as well as the power, depend not only on the variance (σ²) of the cluster means and the number of clusters (n), but also on the number of intervention start times (T) and the correlation between repeated measurements in a single cluster. The conclusions to be drawn from comparative investigations into the efficiency of various SWD trials, cluster-randomized parallel-group trials, and individually randomized trials therefore depend on the number of participating individuals, the number of times measurements are repeated per individual, the number of clusters starting intervention at the same time, and the number of possible starting times.

The eBox compares cluster-randomized SWD trials and parallel-group trials in various scenarios in which both the variance σ^{2} of the cluster means and their correlation ρ between time points are functions of the so-called intraclass correlation coefficients (ICCs) within clusters. If one measures the efficiency of a design in terms of the total number of clusters required to detect an effect of θ = 0.25 with a probability (power) of 90% in a test at the usual significance level α = 0.05 (2-tailed), the findings are as follows: in these settings, SWD trials are more efficient than parallel-group trials unless ICC values are very low (eFigure). However, it should be noted that there is a fundamental qualitative change in this picture if, unlike in the scenarios investigated in the eBox, the number of measurements to be performed at each point in time in individual clusters is the same for all designs. Then, parallel-group trials are substantially more efficient than SWD trials unless ρ is very large.

**eBox**

**eFigure**

How to proceed when outcome parameter variance and time-dependent correlation are unknown

The facts and conclusions on statistical planning and analysis of SWD trials that are summarized here hold under the assumption that both the variance σ^{2} between clusters and the correlation coefficient ρ between measurements for one cluster at different times are known quantities. Whenever an SWD trial needs to be analyzed without this prior knowledge, a much more complicated statistical procedure must be used allowing to estimate in addition to the treatment effect θ, the parameter of primary interest, also σ^{2} and ρ from the data obtained in the study.

This extended procedure was used to obtain the findings shown in Box 2 by analyzing the sample SWD trial described in Table 3. Full details of the procedure can be found in the documentation relating to software programs for analyzing mixed linear models such as the SAS PROC MIXED Procedure (9). Such complex statistical models should also be used to analyze trials in which correlations between repeated measurements are assumed to be due to intraindividual effects. Among others, this this typically implies that the variation between clusters cannot be described any longer in accordance with Condition 2 by a single dispersion parameter. Even when σ^{2} and ρ have to be estimated as part of the analysis of an SWD trial, the trial is usually (4) planned as described above for settings in which σ^{2} and ρ are known.

**Box 2**

**Table 3**

Discussion

Like true crossover trials, SWD trials yield longitudinal data, since the outcome variable is measured repeatedly in each observational unit (cluster). Another common feature of these two trial designs is that they entail a high risk of producing misleading results: if the very restrictive underlying assumption that there is no interaction between intervention effect and measurement time is incorrect, the treatment effect cannot be estimated without bias. It is very important to bear this caveat in mind when planning and interpreting trials.

Alternatively, an SWD trial can also be regarded as a sequence of T + 1 parallel-group trials, with a constant sample size (n) but a proportion of observational units allocated to the invention arm that varies over time (increasing from 0 to 100%).

Even if the cluster means obtained in an SWD trial are normally distributed, approximations are usually needed to test hypotheses concerning the treatment effect. Various approaches are available for this purpose. They yield different results, and none can be said to be generally preferable to the others. Furthermore, as is often the case when analyzing longitudinal data, SWD trials are usually analyzed on the basis of assumptions about the correlation structure that greatly simplify the true situation (equicorrelation model).

The main practical incentive stated for conducting an SWD trial is usually a wish to give all patients access to the intervention being investigated, at least in the last period of the trial. This is particularly desirable if there is information available suggesting that the intervention is effective. This was the pivotal argument for the trial conducted in Gambia: those conducting the trial were sure that vaccination was essentially effective.

SWD trials are thus an alternative to conventional trials when there are practical limitations that preclude carrying out cluster-randomized trials. Conducting a cluster-randomized trial would require the nursing staff training, etc. associated with the trial intervention to be performed swiftly enough for the intervention to be started in all trial patients simultaneously.

If statistical analysis of the trial is performed correctly (and at an appropriate level of complexity), basic methodological requirements can be met. Although the conditions required for a valid statistical evaluation of the treatment effect can be specified clearly in theory, in practice they are difficult to test.

**eTable**

Conflict of interest statement

The authors declare that no conflict of interest exists.

Manuscript received on 12 October 2018, revised version accepted on 15 April 2019.

Translated from the original German by Caroline Shimakawa-Devitt, M.A.

Corresponding author:

Prof. Dr. rer. nat. Maria Blettner

Institute for Medical Biostatics, Epidemiology and Informatics

Johannes Gutenberg University Mainz

Obere Zahlbacher Str. 69

55131 Mainz, Germany

blettner@uni-mainz.de

Cite this as:

Wellek S, Donner-Banzhoff N, König J, Mildenberger P, Blettner M:

Planning and analysis of trials using a stepped wedge design—

part 26 of a serieson evaluation of scientific publications.

Dtsch Arztebl Int 2019; 116: 453–8. DOI: 10.3238/arztebl.2019.0453

Prof. Dr. rer. nat. Stefan Wellek

Department of General Practice/Family Medicine, University of Marburg: Prof. Dr. med. Norbert Donner-Banzhoff, MHSc

Institute for Medical Biostatistics, Epidemiology and Informatics (IMBEI), Faculty of Medicine, Johannes Gutenberg University of Mainz: Dr. sc. hum. Jochem König

Institute for Medical Biostatistics, Epidemiology and Informatics (IMBEI), Faculty of Medicine, Johannes Gutenberg University of Mainz: Philipp Mildenberger, MSc

Institute for Medical Biostatistics, Epidemiology and Informatics (IMBEI), Faculty of Medicine, Johannes Gutenberg University of Mainz: Prof. Dr. rer. nat. Maria Blettner

**Box 1**

**Box 2**

**Table 1**

**Table 2**

**Table 3**

**eBox**

**eFigure**

**eTable**

1. | Beard E, Lewis JJ, Copas A, et al.: Stepped wedge randomised controlled trials: systematic review of studies published between 2010 and 2014. Trials 2015; 16: 353 CrossRef MEDLINE PubMed Central |

2. | Cook TD, Campbell DT: Quasi-experimentation: design and analysis issues for field settings. Boston: Houghton Mifflin 1979. |

3. | Gambia Hepatitis Study Group: The Gambia Hepatitis Intervention Study. Cancer Res 1987; 47: 5782–7. |

4. | Hussey MA, Hughes JP: Design and analysis of stepped wedge cluster randomized trials. Contemp Clin Trials 2007; 28: 182–91 CrossRef MEDLINE |

5. | Wellek S, Blettner M: On the proper use of the crossover design in clinical trials: part 18 of a series on evaluation of scientific publications. Dtsch Arztebl Int 2012; 109: 276–81 VOLLTEXT |

6. | Hemming K, Lilford R, Girling AJ: Stepped-wedge cluster randomised controlled trials: a generic framework including parallel and multiple-level designs. Stat Med 2015; 34: 181–96 CrossRef MEDLINE PubMed Central |

7. | Rhoda DA, Murray DM, Andridge RR, Pennell ML, Hade EM: Studies with staggered starts: multiple baseline designs and group-randomized trials. Am J Public Health 2011; 101: 2164–9 CrossRef MEDLINE PubMed Central |

8. | Hughes JP, Granston TS, Heagerty PJ: Current issues in the design and analysis of stepped wedge trials. Contemp Clin Trials 2015; 45 (Pt. A): 55–60 CrossRef MEDLINE PubMed Central |

9. | SAS: SAS/STAT(R) 14.1 User‘s guide. The MIXED procedure. https://support.sas.com/documentation/cdl/en/statug/68162/HTML/default/viewer.htm#statug_mixed_details.htm (last accessed on 20 May 2019). |

10. | Hoogendijk EO, van der Horst HE, van de Ven PM, et al.: Effectiveness of a geriatric care model for frail older adults in primary care: Results from a stepped wedge cluster randomized trial. Eur J Intern Med 2016; 28: 43–51 CrossRef |

11. | Coleman K, Austin BT, Brach C, Wagner EH: Evidence on the Chronic Care Model in the new millennium. Health Affairs 2009; 28: 75–85 CrossRef MEDLINE PubMed Central |

12. | Brook RH, Ware JEJ, Davies-Avery A , et al.: . Overview of adult health measures fielded in Rand’s health insurance study. Med Care 1979 ; 17: 1–131. |