KEY CONCEPTS SERIES| Volume 134, P174-177, June 01, 2021

# Effect Modifiers and Statistical Tests for Interaction in Randomized Trials

Open Access

## Abstract

Statistical analyses of randomized controlled trials (RCTs) yield a causally valid estimate of the overall treatment effect, which is the contrast between the outcomes in two randomized treatment groups commonly accompanied by a confidence interval. In addition, the trial investigators may want to examine whether the observed treatment effect varies across patient subgroups (also called ‘heterogeneity of treatment effects’), i.e. whether the treatment effect is modified by the value of a variable assessed at baseline. The statistical approach for this evaluation of potential effect modifiers is a test for statistical interaction to evaluate whether the treatment effect varies across levels of the effect modifier. In this article, we provide a concise and nontechnical explanation of the use of simple statistical tests for interaction to identify effect modifiers in RCTs. We explain how to calculate the test of interaction by hand, applied to a dataset with simulated data on 1,000 imaginary participants for illustration.

## 1. Background

Randomized controlled trials (RCTs) are considered the gold standard when evaluating a treatment's effectiveness because of their high internal validity when appropriately conducted. The goal of randomization is to balance both observed and unobserved participant characteristics between two (or more) randomly allocated treatment groups. Thus, the RCT design allows causal effects of treatments to be estimated because confounding will generally not be an issue [
• Little R.J.
• Rubin D.B.
Causal effects in clinical and epidemiological studies via potential outcomes: concepts and analytical approaches.
]. Usually, statistical analyses of RCTs yield an estimate of the overall treatment effect (say, Eoverall), which is the contrast between the outcomes in two treatment groups commonly accompanied by a confidence interval.
RCTs can also have good external validity if they are based on real-life populations that are relevant for the intervention, treats the control group with an acceptable standard of care, and reports outcomes that are meaningful. An ideal trial in this regard enrolls patients with a broad range of background characteristics, for example, disease severity, age, sex, race, and prior therapies. Following the primary analyses estimating the overall treatment effect, Eoverall, the trial investigators may want to examine whether the observed treatment effect varies across patient subgroups (also called ‘heterogeneity of treatment effects’). In such cases we are interested in examining whether the treatment effect is modified by the value of another variable (i.e. the effect modifier) [
• Kent D.M.
• et al.
The Predictive Approaches to Treatment effect Heterogeneity (PATH) Statement.
]. The statistical approach for evaluating potential effect modifiers is a test for statistical interaction [
• Altman D.G.
• Bland J.M.
Interaction revisited: the difference between two estimates.
].
Findings from investigating heterogeneity of treatment effects for an RCT are important for understanding, interpreting, and translating the findings, and consequently for determining whether there is an appropriate patient sub-population for treatment use. Evidence for effect modification therefore helps to delineate applicability of an intervention, showing in whom the treatment is most likely to work and thus indicative of an RCT's external validity. In this article, we provide a concise and nontechnical explanation of simple statistical tests for interaction to identify effect modifiers in RCTs.

## 2. Definition

The statistical tests for interaction are often referred to as subgroup analyses, implying any comparison of effect between treatment groups (net benefit) across subsets (i.e. subgroups) of patients with specific characteristics that could be potentially relevant effect modifiers. Usually subgroup analyses investigate subgroups defined by a factor measured either before or at baseline, such as sex (males vs. females). Subgroup analyses can be misleading if they are based on data-driven hypotheses, employ inappropriate statistical methods, or fail to account for multiple testing [
• Schandelmaier S.
• Briel M.
• Schmid C.H.
• Devasenapathy N.
• Hayward R.A.
• et al.
Development of the Instrument to assess the Credibility of Effect Modification Analyses (ICEMAN) in randomized controlled trials and meta-analyses.
]. As exemplified by Alosh et al [
• Alosh M.
• Huque M.F.
• Bretz F.
• D'Agostino Sr, R.B.
Tutorial on statistical considerations on subgroup analysis in confirmatory clinical trials.
], one should distinguish between three categories of subgroup analysis: (i) exploratory analyses search for differential responses from early clinical trial data or from clinical trials that failed to establish treatment efficacy in its intended population; (ii) supportive analyses aim at investigating the consistency of treatment effect across subgroups for a clinical trial that has established treatment efficacy in its intended overall population; and finally (iii) inferential analyses aim at establishing treatment efficacy in a pre-defined targeted subgroup and/or in the overall population.
The subgroups of interest are defined, preferably a priori, and the baseline variable under consideration needs to precede treatment in time. In the simplest case, our baseline factor is a covariate with only two levels (e.g. male vs. female subjects), leading to two subgroups (e.g. subgroup 1: males, subgroup 2: females). If we want to compare the treatment effects observed in the two subgroups, a first step is to estimate the treatment effects (i.e. net benefit) within each subgroup in separate analyses (E1 and E2, respectively). Next, a test for statistical interaction comparing the two subgroups can be calculated by hand based on the subgroup treatment effects (E1 and E2) and their corresponding standard errors ($SEE1$ and $SEE2$) [
• Altman D.G.
• Bland J.M.
Interaction revisited: the difference between two estimates.
]:
• Difference between subgroup effects, $d=E1−E2$
• Standard error for d, $SEd=SEE12+SEE22$
• Test statistics for the z-test, $zvalue=dSEd$
The p-value can be found by using the absolute (non-negative) z value which gives a test of the null hypothesis that in the population the difference between subgroups (d) is zero, by comparing the value of z to the standard normal distribution. For effect measures on a multiplicative scale (such as risk ratio, hazard ratio, or odds ratio) as opposed to the additive scale (such as risk differences), the analyses should be performed using the log-transformation and with the corresponding standard errors [
• Altman D.G.
• Bland J.M.
Interaction revisited: the difference between two estimates.
]. Importantly, effect modification may be present on one scale but not on another, and conflicting opinions exist on which scale to use [
• Doi S.A.
• Furuya-Kanamori L.
• Xu C.
• Lin L.
• Chivese T.
• Thalib L.
Questionable utility of the relative risk in clinical research: a call for change to practice.
]. The European Medicines Agency (EMA) recommends using the scale on which the endpoint is commonly analyzed, and to present supplementary analyses on the complementary scale where inconsistency is observed [
European Medicines Agency (EMA)
Guideline on the investigation of subgroups in confirmatory clinical trials.
].

## 3. Application

For presenting the results of subgroup analyses graphically, forest plots are useful. Preferably, the plots should include a bold vertical line at the overall treatment effect (i.e., Eoverall) rather than at the null (i.e., ‘no effect’) to guide correct interpretation regarding heterogeneity of treatment effects across subgroups. Fig. 1 illustrates an example based on a simulated dataset on 1,000 imaginary participants (randomized 1:1); the data was generated to reveal a standardized mean difference corresponding to a statistically significant moderate overall treatment effect of Eoverall = 5.00 (95%CI: 3.73 to 6.27) units. To this dataset we deliberately generated a contextual factor (CF 1) that would create two separate subgroups with different magnitudes of treatment effects (E1: 8.00 and E2: 2.00 units, respectively). The standard errors can be calculated from the confidence intervals shown in the figure, $SEE1$: (9.75-8.00)/1.96 = 0.89 and $SEE2:$ (3.78-2.00)/1.96 = 0.91, respectively. From these values we can test the interaction and estimate the difference between the subgroups (with confidence interval). The test of interaction:
$d=8.00−2.00=6.00$

$SEd=0.892+0.912=1.273$

$zvalue=6.001.273=4.71$

A z-value of 4.71 gives p<0.001 when we refer it to a table of the normal distribution. The estimated interaction effect is d = 6.00 units; the corresponding 95% confidence interval is 6.00 ± 1.96*1.273 (i.e., 95%CI 3.50 to 8.50). The data thus provide evidence for effect modification, indicating that the treatment effect is significantly stronger in CF 1-positive than CF 1-negative trial participants. The other presented contextual factors shown in Fig. 1 were computer-generated completely at random, and thus any apparent effect modification across CF 2, CF 3, …., and CF 7 reflect purely chance findings (a well-known caveat to multiple testing without an a priori hypothesis).

## 4. Pointers

Altman and Bland originally presented this simple approach as an “interaction revisited” statistics notes, in the BMJ back in 2003 [
• Altman D.G.
• Bland J.M.
Interaction revisited: the difference between two estimates.
]. This approach is transparent and feasible when we want to compare two estimated quantities, such as means (Fig. 1) or proportions (Fig. 2), each with its standard error.
Although highly feasible, investigating subgroup effects should be done with great care and interpreted cautiously. Most trials are not powered to detect subgroup differences but reporting the results anyway will allow future meta-analyses to investigate this based on several trials thereby achieving the sufficient power. Currently, there exist no explicit/standard list of factors to be investigated for effect modification in trials. However, one may initially be inspired by the U.S. Food and Drug Administration (FDA) requiring effectiveness data to be analyzed by sex, age, and racial subgroups.

## References

• Little R.J.
• Rubin D.B.
Causal effects in clinical and epidemiological studies via potential outcomes: concepts and analytical approaches.
Annu Rev Public Health. 2000; 21: 121-145
• Kent D.M.
• et al.
The Predictive Approaches to Treatment effect Heterogeneity (PATH) Statement.
Ann Intern Med. 2020; 172: 35-45
• Altman D.G.
• Bland J.M.
Interaction revisited: the difference between two estimates.
Bmj. 2003; 326: 219
• Schandelmaier S.
• Briel M.
• Schmid C.H.
• Devasenapathy N.
• Hayward R.A.
• et al.
Development of the Instrument to assess the Credibility of Effect Modification Analyses (ICEMAN) in randomized controlled trials and meta-analyses.
Cmaj. 2020; 192: E901-e906
• Alosh M.
• Huque M.F.
• Bretz F.
• D'Agostino Sr, R.B.
Tutorial on statistical considerations on subgroup analysis in confirmatory clinical trials.
Stat Med. 2017; 36: 1334-1360
• Doi S.A.
• Furuya-Kanamori L.
• Xu C.
• Lin L.
• Chivese T.
• Thalib L.
Questionable utility of the relative risk in clinical research: a call for change to practice.
J Clin Epidemiol. 2020;
• European Medicines Agency (EMA)
Guideline on the investigation of subgroups in confirmatory clinical trials.
European Medicines Agency (EMA), London, United Kingdom2019