Sample size calculations are poorly conducted and reported in many randomized trials of hip and knee osteoarthritis: results of a systematic review.

OBJECTIVES
To review the methodology and reporting of sample size calculations in a contemporary sample of trials in osteoarthritis.


STUDY DESIGN AND SETTING
Randomized trials in hip and/or knee osteoarthritis published in 2016 were identified by searching MEDLINE, Cochrane library, CINAHL, EMBASE, PsycINFO, PEDro, and AMED until March 31, 2017. Data were extracted on study characteristics, methods used to calculate the sample size, and the reporting and justification of components used in the sample size calculation. We attempted to replicate the sample size calculation using the reported information.


RESULTS
This review included 116 trials. Seventy-eight (67%, n = 78/116) reported a power calculation. Less than a quarter reported all core components of the sample size calculation (21%, n = 16/78). The sample size calculation was only reproducible in 53% of the trials that reported a power calculation (n = 41/78). The replicated calculation produced a sample size over 10% larger than the reported value in 12% of trials (n = 9/78). Insufficient information was reported to allow the sample size calculation to be replicated in a quarter of trials (27%, n = 21/78).


CONCLUSION
Sample size calculations in trials of hip and knee osteoarthritis are not adequately reported, and the calculation frequently cannot be reproduced.


Introduction
Sample size calculation is a key part of designing a clinical trial and is important for ethical, practical, and financial reasons. An overly large sample size can increase trial costs, delay dissemination of study findings, and result in more participants receiving a treatment when there is already sufficient evidence to show it is inferior to an alternative [1]. An overly small sample size can lead to underpowered trials that are more likely to ''miss'' a clinically important treatment effect, should it exist [2,3].
Altman et al. emphasized the importance of reporting the justification for the target sample size, especially when the trial does not recruit as many participants as planned [4]. When the sample size calculation is adequately reported, the reader can understand what the study was designed to achieve. The difference between the treatments that the trial was designed to statistically detect (the target difference), with associated assumptions, should be What is new?

Key findings
Sample size calculations in hip and knee osteoarthritis trials are often poorly reported, and where reported, the calculation often cannot be replicated.
The standard deviation assumed in the sample size calculation was a poor estimate of the observed standard deviation in a substantial proportion of trials.

What this adds to what was known?
This is the first review of sample size calculations in trials of hip and knee osteoarthritis and shows that the problems of poor reporting and lack of reproducibility of sample size calculations exist in this clinical area.

What is the implication and what should change now?
Trialists and reviewers should ensure that sample size calculations are reported clearly and completely to facilitate the interpretation of trial results and prevent the conduct of underpowered trials.
Trialists should perform a sensitivity analysis at the design stage to explore how a difference in the estimate of the standard deviation could affect the power of the study.
specified [5]. If well justified, the target difference can inform the interpretation of the trial findings, clarifying the presence (or absence) of a meaningful difference. Appropriate calculation of the sample size and reporting of the calculation help to avoid research waste, preventing the conduct of trials that are likely to produce inconclusive and potentially misleading results. Previous systematic reviews have found that power calculations are often not performed, inadequately reported, or based on inaccurate assumptions [5e7]. A study may be underpowered if the parameters used to calculate its sample size are based on inaccurate assumptions [8e10]. Reviews of trials in a handful of specific conditions, such as back pain and rheumatology, have highlighted poor reporting of sample size calculations [11e13]. Focusing on a specific clinical area reduces the heterogeneity in the assumptions made in the sample size calculation. For instance, oncology trials are more likely to be powered on survival, which is not usually applicable to low-mortality conditions such as osteoarthritis [14].
We explored whether recently published osteoarthritis trials also poorly reported their sample size calculations.
To our knowledge, the sample size calculations of hip and knee osteoarthritis trials have not previously been reviewed. Few reviews of any clinical area have attempted to replicate the sample size calculation of published trials [5,7,15,16]. Even fewer have compared the standard deviation assumed in the sample size calculation with the observed values in the trial results [7,9].

Objectives
Primary objective of this study was to summarize current practice in calculating the sample size for trials of hip and knee osteoarthritis, including the sample size, target difference, and justification for the chosen inputs.
Secondary objectives were to assess the reporting and reproducibility of these sample size calculations.

Materials and methods
The study methods were described in a published protocol and are summarized below [17].

Identification of studies
Seven databases were searched to identify relevant articles published in 2016: MEDLINE, Cochrane library (CENTRAL), CINAHL, EMBASE, PsycINFO, PEDro, and AMED (MEDLINE search strategy in Appendix B). The final search was performed on March 31, 2017, to allow for a 3-month lag between publication and database indexing.

Selection of studies
Abstracts and full texts were each screened independently by two of four reviewers (B.C., U.A., K.V., and J.Y.T.). Disagreements were resolved by discussion with a third reviewer (J.A.C.).

Inclusion criteria
Studies were eligible for inclusion if the article was the primary report of a randomized controlled trial of two treatment arms in a hip and/or knee osteoarthritis population. Included articles were published online or in a journal issue in 2016.

Exclusion criteria
The following article and study types were excluded: Conference abstracts Study protocols Non-English language articles Quasirandomized and nonrandomized studies Pilot and feasibility studies Factorial designs Cross-over trials Trials with three or more arms Studies that did not evaluate treatments (e.g., comparing different methods of providing information to improve patient knowledge) Studies examining osteoarthritis prevention Studies combining osteoarthritis and nonosteoarthritis populations (e.g., participants with osteoarthritis or rheumatoid arthritis, or trials of total knee arthroplasty where it was not explicitly stated that all participants had osteoarthritis) Secondary analyses of trials (e.g., long-term followup or subgroup analyses)

Data extraction
Data extraction on study characteristics included the study design, population, eligibility criteria, intervention and comparison treatments, and primary outcome. Data extraction on the sample size calculation included the target sample size, calculation method, values used, and justification (e.g., effect size, target difference, standard deviation, loss to follow-up, use of a one-tailed or two-tailed test, significance level, and power). Data extraction on the study results included the number of participants randomized, number lost to followup, and standard deviation of the primary outcome.
A second reviewer independently extracted the data from a sample of 20% of the included studies. Additional details of the sample size calculation were extracted from the study protocol if cited in the main article.

Sample size replication
Core values for the sample size calculation were defined as the power, significance level, whether a one-tailed or two-tailed test was used, level of attrition, and for continuous outcomes in superiority trials: the target difference as a standardized effect size or mean difference and standard deviation. for continuous outcomes in noninferiority trials: the noninferiority margin, mean difference, and standard deviation. for binary outcomes in superiority trials: any two of the anticipated between-group risk difference, effect in the intervention group, and effect in the control group.
We attempted to replicate the sample size calculations using the reported values. Unless otherwise stated, we assumed that 80% power and 5% two-tailed significance level with a superiority hypothesis were used, anticipating no attrition. For noninferiority trials, where not reported, we assumed the anticipated mean difference was 0.
To compare the replicated and reported target sample sizes, we calculated the ratio: replicated value À reported value reported value : We present the number of studies with a replicated value over 10% or 30% above or below the reported sample size (ratio above 1.1 and 1.3 or below 0.9 and 0.7). The calculations were considered reproducible if the replicated value was within 10% of the reported value, to account for potential differences in software and rounding errors.

Data synthesis
For categorical and binary outcomes, data were summarized using the number and proportion of studies. For continuous outcomes, data were summarized using the median and interquartile range (IQR).
For continuous outcomes, the standard deviation assumed in the sample size calculation was compared with the corresponding value in the study results at the final follow-up time point. Again, we present the number of studies with a ratio above 1.1 and 1.3 or below 0.9 and 0.7.

Subgroup analyses
Subgroup analyses were performed to assess whether trial characteristics were associated with the number of participants randomized, whether the sample size calculation was reported and fully specified, and the reproducibility of the sample size calculation (reported value within 10% of the replicated value). Subgroup analyses examined differences by intervention (surgical or nonsurgical), number of trial centers (single center or multicenter), funding source (full/partial industry funding or no industry funding), and comparator treatment (placebo/waitlist or active control).
For continuous outcomes, subgroups were compared using the median difference and 95% confidence interval, estimated using HodgeseLehmann [18,19]. For binary outcomes, subgroups were compared using absolute risk differences with 95% confidence intervals. A significance level of 0.05 was used. As the subgroup analyses were exploratory, no adjustments were made for multiple testing.
Of the 116 included trials, 78 (67%) reported a power calculation. Trials reporting a power calculation were more likely to have a larger sample size, cite the trial protocol, and report the trial funding source (Table 1). They were otherwise generally similar to trials not reporting a power calculation.
Among the 38 trials that did not report a power calculation, one reported a post hoc power calculation and six reported that a power calculation was conducted but did not provide details. Of the remaining 31 trials, four reported that the sample size was based on the predefined recruitment period and 27 did not justify the number of participants. In the 31 trials that did not report conducting a power calculation, 65% (20/31) mentioned the small sample size or lack of power calculation as a limitation of the trial and another 10% (3/31) stated that future trials with larger sample sizes were necessary.
The results that follow are based on the 78 trials that reported details of a power calculation.

Sample size calculation methodology
All of the included trials used a conventional (Ney-manePearson or statistical hypothesis testing) power calculation approach [23,24]. None used alternative techniques, such as Bayesian approaches or simulations [25e27]. Two trials (3%, n 5 2/78) reported a sample size calculation that was inappropriate for the study design; one used a sample size calculation for a paired sample when using an unpaired design and the other used a survey-based approach [28,29].
Most of the trials had a continuous primary outcome (97%, n 5 76/78) ( Table 2). One trial used a binary primary outcome, and none used a time-to-event primary outcome. The type of primary outcome in the remaining trial was unclear as the target difference was reported as a percentage. The trials were usually powered on one primary outcome (91%, n 5 71/78). One trial planned to re-estimate the sample size if attrition was higher than expected, but this was found to be unnecessary. Three trials conducted unplanned sample size re-estimations (4%, n 5 3/78) due to poor recruitment, low attrition, or post hoc analysis. Nine trials conducted sensitivity analyses on their sample size calculations (12%, n 5 9/78), usually to assess the power for secondary outcomes (n 5 4).

Reporting sample size calculations
Only 21% (n 5 16/78) of the studies reported all of the core components of their sample size calculation. Almost all of the trials reported the power and significance level (96%, n 5 75/78). However, other components were not well reported, including the level of attrition (73%, n 5 57/78) and whether a one-tailed or two-tailed test was used (41%, n 5 32/78).
Almost all superiority trials powered on a continuous outcome reported the mean difference (90%, n 5 61/68), and most reported the standard deviation (66%, n 5 45/ 68). Most trials reported the standardized effect size or enough information to calculate it (79%, n 5 54/68).
The single included cluster-randomized trial reported the intraclass correlation coefficient assumed in the power calculation to adjust for clustering. The single included trial with a binary outcome reported the risk difference between the two arms but not the anticipated effect in the intervention or control group.
Thirteen trials referred to a study protocol. Most of them (77%, n 5 10/13) reported the sample size calculation consistently between the study protocol and results publication. The three trials with discrepancies reported (i) different target sample sizes, (ii) different levels of attrition, and (iii) only a revised sample size calculation in the results publication.
Where reported, the standardized effect size that the trial was powered to detect was moderate to large for most trials (median 0.75, IQR 0.50 to 0.86). Only one trial reported a standardized effect size of 0.2 or less for the target difference. However, it is likely that this target difference was incorrectly reported as the reproduced sample size for this trial was much greater than the reported sample size [30]. Table 3 shows the proportion of trials with continuous outcomes that justified the mean difference and standard deviation. The mean difference was most commonly based on a treatment difference from a published trial (29%, n 5 22/76) or a published minimum clinically important difference (17%, n 5 13/76). Where reported, justifications for the standard deviation used were almost exclusively based on previously published trials (33%, n 5 25/76). Very few trials justified the anticipated level of attrition (3%, n 5 2/76).

Reproducibility
Only half of the reported sample size calculations were reproducible (53%, n 5 41/78) ( Table 4). The replicated calculations produced a sample size over 10% larger than the reported value in 12% of trials (n 5 9/78) (Fig. 2). One-quarter of the trials did not report enough information for us to replicate the sample size calculation (28%, n 5 22/ 78). The sample size could be replicated in most of the trials that reported all of the core components, so that no assumptions were needed (88%, n 5 14/16).
The absolute difference between the reproduced and reported sample size in terms of the number of participants was small for most studies (median difference 1 participant, IQR 0 to 5). Five studies showed a difference between the reproduced and reported sample size that was greater than 50 participants (9%, n 5 5/56). Four of these five trials underestimated the sample size (replicated value over 30% larger than the reported value) and one overestimated it (replicated value at least 30% smaller than the reported value).

Accuracy of components
Comparing the standard deviation of the primary outcome to the follow-up value, the standard deviation used in the sample size calculation was accurate (within 10%) in only one-third of trials (31%, n 5 9/29) ( Table 5). The follow-up standard deviation was over 30% larger than the value assumed in the sample size calculation in six trials (21%, n 5 6/29), leading to a reduction in power.

Subgroup analysis
Exploratory subgroup analysis did not detect any significant differences in reporting based on funding source, study intervention, or comparator type (Appendix C). Multicenter trials recruited significantly larger sample sizes.

Summary of findings
This systematic review summarizes current practice in the methodology, reporting, and replicability of sample size calculations in randomized trials of hip and knee osteoarthritis. Twothirds of the trials reported a sample size calculation. Most of the remaining one-third made no reference to their choice of sample size. Almost all sample size justifications were based on a conventional power calculation approach. The sample size calculation was fully described in very few studies. The sample size calculation could often not be replicated.
The studies most commonly omitted the anticipated attrition or standard deviation. Where reported, the justification for the target difference was based on findings of previous trials and/or a published estimate of the minimum clinically important difference. The standard deviation was commonly based on the results of previously published trials. However, for many trials, the standard deviation assumed in the power calculation was inaccurate (either too small or too large) when compared to the follow-up results of the trial. Underestimating the standard deviation a This included four trials where additional assumptions were required on the interpretation of the reported information to replicate the sample size; for two trials, the reported target difference had to be translated into a different scale, and for two trials, the reported value and replicated value before accounting for attrition were compared because the anticipated attrition rate was unusually high. Comparison of reproduced and reported sample sizes (as percentage of reported value). The red markers represent trials excluded from the figure where the difference was over 50% (five trials) and where the difference was below À50% (four trials). (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.) Table 5. Accuracy of the standard deviation (n 5 29) a

Comparison with related literature
This review shows that poor reporting is a problem, specifically in osteoarthritis trials, despite it being a welldeveloped area of clinical research. Overall, the reporting of sample size calculations found in this study was similar to that found in other clinical and methodological areas, with previous reviews finding that 50%e70% of trials reported a sample size calculation and around 25% of sample size calculations included all core components [12,13,32e34]. Although one review across multiple clinical areas found a higher proportion of studies reporting a power calculation (95%) and reporting all core components, this may be because it only considered publications in journals with a high impact factor [7].
The level of reporting of specific components of the sample size calculation (e.g., power, significance level, target difference) was consistent with other reviews [7,15,16]. Although our review agreed with Rutterford et al. that the assumed standard deviation, level of attrition, and justification for the target treatment difference were poorly reported, Rutterford et al. found much lower levels of reporting for these components [33]. This could be due to differences in reporting practices over time or between clinical areas. The target difference was most commonly justified using previous trials, which aligns with the results of a survey suggesting high awareness and endorsement of this method among trialists [35]. However, the survey also reported that trialists commonly used pilot studies to justify the target difference, which was rare in the osteoarthritis trials in our review [35].
There are mixed findings in the literature on the reproducibility of sample size calculations. Some reviews have found a similar level of discrepancies between the replicated and reported sample sizes to our review [5,16]. Reviews that found a higher quality of reporting also found a much higher proportion of replicable calculations [7,15].

Strengths and limitations
The key strengths of this review are the systematic search strategy and restricted eligibility criteria. Restricting eligibility to trials of hip and knee osteoarthritis produced a more homogeneous sample in terms of the outcome measures used and population from which the trials recruited, compared to reviews considering trials in any clinical area. By including a contemporary sample of trials, this review should provide insight into the clinical trial methodology used in current practice.
The main limitation of this review is that, while the overall sample was substantial, some of the subgroups were small. The findings of these subgroup analyses should be interpreted cautiously. As the review was restricted to trials published in 2016, we cannot draw conclusions about changes in reporting and methodology over time. The sample of included articles may be less representative of lower impact journals, as they usually take longer time to be indexed in databases [36].
Our assessment of sample size calculations relied on published information and thus was hindered by poor reporting. For example, a trial's sample size may have been calculated using an appropriate power calculation without this being reported in the results paper. It is also possible that a sample size calculation may have been modified after trial design but before publication [37]. Therefore, the a priori sample size calculation conducted during the study design stage may not have been described accurately in the reported results paper. A review of trial protocols or ethics applications may more accurately reflect sample size calculations done during the design phase [5]. The results of this review may not be applicable to other clinical areas, particularly where dichotomous or time-to-event primary outcomes are common.

Implications
Although there are examples of good practice in the literature, it is concerning that one-third of trials of hip and knee osteoarthritis made no reference to the choice of sample size. When a power calculation was reported, there was often insufficient information to reproduce the calculation. Sample size calculations thus often cannot be verified, making it difficult for readers to interpret the trial results in view of the assumptions made when the study was designed. There is potential for improvement in the reporting of the predicted level of attrition, standard deviation, use of a one-tailed or two-tailed test, and justification for values used in the calculation.
For some trials, the value produced by attempting to replicate the sample size calculation was very different to the sample size reported in the trial publication. The reported information may have been misleading or inaccurate. Alternatively, there may have been a fundamental error in the original calculation, as was the case for at least a few trials where the calculation was clearly inappropriate for the trial design. Our results highlight inaccuracies in the standard deviation assumed in sample size calculations. Underestimating the standard deviation can lead to underpowered trials. Trialists should prespecify the target difference in terms of the between-group difference in the original scale of the primary outcome measure and the corresponding standard deviation, rather than specifying only the standardized effect size [38]. Trialists should perform sensitivity analyses to explore how changes in the assumed standard deviation will affect study power.
The poor reporting and lack of reproducibility of sample size calculations found in this review contribute to research waste [39,40]. Conducting a power calculation when designing a study can prevent underpowered trials from being carried out if they are likely to be uninformative. Clear and complete reporting of a power calculation allows the reader to see the primary outcome and the treatment effect believed to be clinically meaningful [41,42]. This helps the reader to interpret the trial results in terms of the likelihood of a false result and the clinical relevance of the findings. Although reporting guidelines have attempted to improve reporting, there is a clear need for statistically trained peer reviewers to ensure adequate reporting of sample size calculations in trial protocols, funding applications, ethical approval, and trial results publications [40]. Trial teams should be encouraged to involve members with formal training in statistics and research design early on in their trials [40].

Future research
Future research could explore the reasons for the lack of reproducibility of sample size calculations, for example, by contacting trial teams for additional information where the methods of calculation are unclear or deemed inaccurate. Future work could also explore whether other factors are associated with high-quality reporting of the sample size calculation, such as statistical peer review [43e45]. Future studies could examine the values of components used in sample size calculations in more detail, such as assessing the clinical relevance of the target differences or developing methods to more accurately predict the standard deviation used in the power calculation.

Conclusion
Sample size calculations in trials of hip and knee osteoarthritis are not consistently reported adequately. Even when reported in sufficient detail, the calculation cannot always be accurately reproduced. This raises concerns about whether the sample size calculation was performed correctly and whether the trial was appropriately designed to achieve its primary objective. It also makes it difficult to establish how likely it is that a meaningful difference between the treatments exists. Clear and accurate reporting of a sample size calculation (or justification) should be mandatory, with endorsement by journal editors and peer reviewers for grant applications, trial protocols, and results publications.