If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Research Department of Oncology, Cancer Institute, Faculty of Medical Sciences, School of Life & Medical Sciences, University College London, London, UKDivision of Informatics, Imaging and Data Science, Faculty of Biology, Medicine and Health, University of Manchester, Manchester Academic Health Science Centre, Manchester, UK
Division of Informatics, Imaging and Data Science, Faculty of Biology, Medicine and Health, University of Manchester, Manchester Academic Health Science Centre, Manchester, UK
Division of Informatics, Imaging and Data Science, Faculty of Biology, Medicine and Health, University of Manchester, Manchester Academic Health Science Centre, Manchester, UK
Sample Size justification of 211 CPMs related to prostate cancer was lacking.
•
An EPP10 is not necessarily sufficient to guide sample size for CPM development.
•
Sample sizes for CPM development should follow formal sample size criteria.
•
We recommend sample size justification in all future prediction model studies.
Abstract
Objective
Developing clinical prediction models (CPMs) on data of sufficient sample size is critical to help minimize overfitting. Using prostate cancer as a clinical exemplar, we aimed to investigate to what extent existing CPMs adhere to recent formal sample size criteria, or historic rules of thumb of events per predictor parameter (EPP)10.
Study Design and Setting
A systematic review to identify CPMs related to prostate cancer, which provided enough information to calculate minimum sample size. We compared the reported sample size of each CPM against the traditional 10 EPP rule of thumb and formal sample size criteria.
Results
About 211 CPMs were included. Three of the studies justified the sample size used, mostly using EPP rules of thumb. Overall, 69% of the CPMs were derived on sample sizes that surpassed the traditional EPP10 rule of thumb, but only 48% surpassed recent formal sample size criteria. For most CPMs, the required sample size based on formal criteria was higher than the sample sizes to surpass 10 EPP.
Conclusion
Few of the CPMs included in this study justified their sample size, with most justifications being based on EPP. This study shows that, in real-world data sets, adhering to the classic EPP rules of thumb is insufficient to adhere to recent formal sample size criteria.
This systematic review highlighted over 200 prediction model published related to prostate cancer, with very few of the included studies justifying their choice of sample size; any justification were usually being based on Events per Predictor Parameter (EPP).
What this adds to what was known?
•
The classic use of the EPP10 rule of thumb is not necessarily enough to guide sample size for prediction model development based on formal criteria.
•
Our study highlights the extent to which this situation has previously been an issue, and so serves as a benchmark for comparison in future reviews of studies in the coming years
What is the implication and what should be done now?
•
There is large scope to improve the justification of sample sizes used in prediction model studies and single threshold values for EPP are insufficient to do this accurately.
•
Regardless of whether a sample size calculation has been used, our recommendation is that the justification for sample size consideration should be included in all prediction model studies going forward.
1. Introduction
Clinical prediction models (CPMs) are statistical models or algorithms that can estimate the risk of existing disease (diagnostic) or the probability of future outcomes (prognostic) for an individual [
]. Estimations of risk are conditional on the values of multiple predictors that are observable at the time one wishes to make a prediction from the model. Classically, these models are based on multivariate modeling methods such as logistic regression for binary outcomes or survival models for time-to-event outcomes.
Although there is a plethora of CPMs developed across medical domains, very few are implemented clinically, despite their many practical uses [
]. Commonly, this lack of uptake is attributed to reduced predictive performance when CPMs are validated in independent cohorts (e.g., external validation) [
]. Small sample sizes may result in extreme estimates of predictor effects (i.e., overfitting), subsequently resulting in poor predictive performance when applied to new patients. Although penalization and shrinkage methods (such as LASSO or ridge regression) are available to help with overfitting, these are not a solution to small sample size [
Historically, studies that develop CPMs have often justified their sample size based on events per predictor parameter (EPP)—the ratio of the number of outcome events, relative to the number of candidate predictor parameters—with an EPP of 10 often taken as a rule of thumb [
] recently published a series of sample size formula to calculate the minimum required sample size for binary, time to event and continuous prediction models. Hereto, these sample size criteria will be referred to as “Riley et al.” with the references being as follows: [
]. Indeed, compared with previous guidance around sample size requirements for prediction models, the Riley et al. criteria are tailored to the model (and clinical context) in question. For example, they are context-specific in terms of outcome incidence and model fit. As such, in this study, we take the Riley et al. criteria as the gold standard for sample size calculation.
However, it is unclear to what extent previously developed CPMs adhere to minimum required sample sizes as calculated by the Riley et al. criteria. Therefore, the aim of this study was to use prostate cancer as a clinical exemplar in which to retrospectively assess whether published multivariable CPMs adhere to the minimum sample size criteria as outlined by Riley et al. [
] and the level of agreement between these criteria and the EPP10 rule of thumb. We chose to focus on CPMs within prostate cancer because this is a common context in which CPMs are developed in both a diagnostic and prognostic prediction setting. This is largely due to their many practical uses, including predicting disease onset [
The management of active surveillance in prostate cancer: validation of the Canary Prostate Active Surveillance Study risk calculator with the Spanish Urological Association Registry.
]; this existing search filter was combined with terms specific to prostate cancer (see supplementary methods). We also searched the reference list of any relevant systematic reviews that we discovered in our database search to identify additional CPM development studies for inclusion. The last search was conducted on the June 30, 2019.
2.2 Eligibility criteria
We included any papers that developed a multivariable model/score/tool/algorithm (hereto termed model) for predicting the individual risk of an outcome of interest in the context of prostate cancer. Because we were interested in models that output the risk of the outcome of interest, we only included prediction models that were based on logistic or cox regression for binary and time-to-event outcomes, respectively [
]. To be included, the papers must have reported sufficient information to allow us to retrospectively calculate the Riley et al. minimum required sample size; any study that did not report sufficient information was excluded. We also excluded any papers that externally validated an existing model (without developing a new model) and those that aimed to update an existing risk model with a new predictor, although such papers were used to identify the paper that developed the original CPM. Furthermore, papers that developed CPMs for multiple anatomical sites were excluded, as were those that were developed within a competing risks/multistate modeling framework (because multiple outcome regression is currently not covered by the Riley et al. [
] criteria). Finally, we excluded any papers that were only available as an abstract (e.g., conference proceedings). We limited to papers published in English or those available with an English translation. The study selection process was documented using the PRISMA flow diagram [
The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration.
Initially, the titles and abstracts of identified papers were screened by the lead author (S.D.C.), cross-referencing against the inclusion and exclusion criteria. Papers satisfying the inclusion and exclusion criteria at this stage were then full-text-screened (which further excluded papers as required). Any uncertainty in whether to include/exclude a particular study was resolved through discussion and consensus with a second reviewer (G.P.M.).
2.4 Data extraction
Primary information that we extracted from identified papers included the following: the sample size used (for model development), the number of outcome events, the predictor parameters considered, and the reported C-statistic (to retrospectively calculate R2 as outlined by Riley et al. [
]). In addition, for time-to-event models, we extracted the mean follow-up and length of follow-up reported in the papers. Extraction of all these values enabled us to retrospectively calculate the Riley et al. minimum sample size criteria [
]. In addition, the EPP was retrospectively calculated using the reported number of events and number of predictors. For calculation of the Riley et al. criteria and EPP, we considered the total degrees of freedom for all variables (of all candidate predictors), where this was possible to determine; if the number of candidate predictors was not reported, then the number of parameters in the final model was used for the calculations. Finally, we noted if the study was reported in accordance with the TRIPOD guidelines (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) [
], and whether the study conducted any form of internal and/or external validation alongside the model development.
2.5 Statistical analysis
Results were summarized using descriptive statistics. We used logistic regression to examine if the odds of a paper surpassing the EPP10 rule of thumb or the Riley et al. sample size criteria changed with time. We used the chi-squared test to examine if the number of papers surpassing the EPP10 rule of thumb and/or Riley et al. sample size criteria differed by study type (development/internal/external validation) or clinical task (i.e., intended prediction aim). All analyses were performed using R version 3.6.2 [
] used to calculate the Riley et al. required sample size.
3. Results
The initial search identified 5,026 papers, with an additional 20 identified through citation searching of systematic reviews (Supplementary Table 1); of these, 2,628 papers remained following removal of duplicates. After the initial title/abstract screening, 305 papers underwent full-text screening, which identified 139 papers for inclusion. These papers resulted in 211 CPMs (because some papers developed more than one CPM). Fig. 1 shows the PRISMA flow diagram, and Supplementary Table 2 gives the full list of included papers.
Fig. 1Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA) flow diagram.
The included papers were published between 1994 and 2019. The intended use of the CPM and the modeling methods varied across the included studies (Table 1). Of the 211 models included in this review, 124 (59%) focused on diagnosis of prostate cancer, 9 (4%) on predicting side effects, 44 (21%) on risk of progression/recurrence, and 34 (16%) on survival/mortality predictions. Overall, 143 CPMs were developed using logistic regression for binary outcomes, and 68 using a Cox proportional hazards model for time-to-event outcomes.
Table 1Distribution of modeling techniques across prediction aim
Prediction aim
Binary
Time to event
Total
All
Development
26
17
43
+ Internal validation
85
31
116
+ External validation
32
20
52
Total
143
68
211
Diagnosis
Development
20
1
21
+ Internal validation
74
1
75
+ External validation
27
1
28
Total
121
3
124
Side effects
Development
3
-
3
+ Internal validation
2
1
3
+ External validation
3
-
3
Total
8
1
9
Progression/recurrence
Development
3
8
11
+ Internal validation
7
16
23
+ External validation
2
8
10
Total
12
32
44
Survival/mortality
Development
-
8
8
+ Internal validation
2
13
15
+ External validation
-
11
11
Total
2
32
34
Note that 139 studies met the inclusion criteria, including 211 models.
Internal validation was defined as any appropriate method that was used to adjust predictive performance for in-sample optimism, such as bootstrap resampling. External validation was defined as any paper that included an independent data set from a distinct population, such as geographical validation.
As shown in Table 1, 43 (20%) of the models detailed development only, 116 (55%) also incorporated internal validation (i.e., adjusted performance for in-sample optimism) and 52 (25%) included development and external validation (i.e., validation in an independent data set).
3.2 Adherence to sample size requirements: overall
From the 139 included studies, 34 (24%) acknowledged limitations of the sample size used to develop their proposed model(s), but only 3 studies outlined how they calculated their minimum required sample size. Of these three studies, the first used EPP10 [
], and the third based their sample size calculation on achieving “92% power [for] a noninferior sensitivity and superior specificity” to a previously developed model [
Supplementary Table 2 depicts which of the included papers satisfied the traditional EPP10 rule of thumb and which satisfied the Riley et al. sample size criteria. Overall, 102 of the included CPMs (48%) were developed on sample sizes that surpassed the Riley et al. criteria, and 145 (69%) of the included models exceeded the traditional 10 EPP rule of thumb. Less than half of the models (47%) satisfied both EPP10 and the Riley et al. criteria (Fig. 2), and 64 models failed to meet both criteria.
Fig. 2A Venn diagram of the included models which surpass and fail to meet events per predictor parameter (EPP)10 and the Riley et al. criteria. A total of 64 models failed to meet both criteria.
Across the CPMs that satisfied the Riley et al. sample size criteria, there was large variability in their EPP (Supplementary Fig. 1). In most CPMs, the Riley et al. criteria resulted in a higher required sample size than that based on EPP10 (Fig. 3). About 38 (18%) of the included CPMs required a lower minimum sample size to satisfy the Riley et al. criteria compared with the sample size that would be required to surpass EPP10. Here, the calculated sample size from the Riley et al. formula was driven by criteria 1 (small optimism in predictor effect estimates) in 141 CPMs, criteria 2 (small difference in apparent and adjusted model fit) in 5 CPMs and criteria 3 (precise estimation of overall risk) in 65 CPMs [
Fig. 3A scatter plot of the sample size that would be needed to satisfy the events per predictor parameter (EPP)10 rule of thumb against the required sample size based on the Riley et al. criteria. Both axes are on the log-scale to aid visual appearance of the plot.
There was no evidence that the proportion of papers surpassing the Riley et al. sample size criteria changed through time (P = 0.323, Fig. 4). We found that study type (development only/+internal/+external validation) was significantly associated with pass rate (P < 0.001). Clinical task was also associated with pass rate (P = 0.005), suggesting differing pass rates between diagnostic models, risk of progression/recurrence models, side effects models, and survival models.
Fig. 4The number of clinical prediction models (CPMs) related to prostate cancer that have been published, and the number of those that satisfy the Riley et al. sample size criteria.
3.3 Adherence to sample size requirements: binary outcomes
Most of the included CPMs were developed for binary outcomes (143 models), of which, 71% satisfied the 10 EPP rule of thumb and 51% satisfied the Riley et al. sample size criteria (Supplementary Table 2). Only 24 (17%) models had a lower required minimum sample size to meet the Riley et al. criteria compared with the sample size that would be needed to meet the 10 EPP rule of thumb.
3.4 Adherence to sample size requirements: time-to-event outcomes
Of the 68 included CPMs that were developed for time-to-event outcomes, 65% satisfied the 10 EPP rule of thumb, and 43% satisfied the Riley et al. criteria (Supplementary Table 2). Only 14 (21%) of the time-to-event CPMs had a lower required sample size according to the Riley et al. criteria compared with the sample size that would be required to surpass the 10 EPP rule of thumb, again showing that the Riley et al. criteria are more strict.
3.5 Quality assessment
Of all included studies, 46 were published after the publication of the TRIPOD guidelines [
Genomic classifier augments the role of pathological features in identifying optimal candidates for adjuvant radiation therapy in patients with prostate cancer: development and internal validation of a multivariable prognostic model.
Development and internal validation of prediction models for biochemical failure and composite failure after focal salvage high intensity focused ultrasound for local radiorecurrent prostate cancer: presentation of risk scores for individual patient prognoses.
Multivariable model development and internal validation for prostate cancer specific survival and overall survival after whole-gland salvage Iodine-125 prostate brachytherapy.
], and all six were based on cox proportional hazards models.
4. Discussion
This systematic review adds important implications to the literature. First, it shows that sample sizes are rarely justified in this field. Second, it highlights that the classic use of EPP≥10 is not necessarily sufficient to guide sample size for prediction model development based on formal criteria. Thirdly, regardless of whether a sample size calculation has been used, our recommendation is that a justification for sample size should be included in all prediction model studies going forward (as a minimum). Finally, our study highlights the extent to which this situation has previously been an issue, and so serves as a benchmark for comparison in future reviews of studies in the coming years (that should aim to examine if improvements have been made). To be clear, the intention of this paper is not to point blame at previous studies in terms of sample size and justification of sample size, but rather highlight to readers that this topic is an important and outstanding issue in the reporting of prediction model studies.
This study found over 200 published CPMs for risk prediction in prostate cancer, but only three justified the sample size used, most of which were based on EPP rules of thumb. We were not able to determine the precise reasons why studies might not justify sample size. One potential reason is that some studies included in this review were published before the TRIPOD guidelines (which includes an item to report how the sample size was arrived at). In addition, we postulate that this also reflects the historic lack of formal guidelines around required sample size in CPM development studies, and the historic blanket use of EPP rules of thumb [
]. Indeed, several systematic reviews have observed that prediction model studies—both development and validation–frequently provide no rationale for the sample size used, or discuss the potential for overfitting [
] provide the required mechanism to allow sample size justification. It is also possible that if researchers have access to a data set of fixed sample size, then sample size justification might not be given. Here, the Riley et al. criteria should be used to determine the maximum number of candidate predictor variables for the fixed sample size.
We acknowledge that all of the CPMs considered in this review were published before the Riley et al. sample size criteria [
] (which were published in 2019); consequently, one cannot expect historically derived CPMs to adhere to these criteria (and, again, we do not intend to point blame). Nonetheless, our findings highlight the importance of carefully justifying the required sample size in all future CPM development studies. The TRIPOD guidelines include the requirement to report how the study sample size was obtained [
]; our findings suggest that these guidelines should be extended to include a requirement for papers to formally justify the sample size (e.g., using the Riley et al. criteria). To use the Riley et al. criteria in practice, one needs to specify the anticipated model fit (i.e., R2) in advance of data collection/model fitting [
]. In this study, we based the sample size calculation on the performance of the fitted model; thus, it could be that—in the future—modelers adhere to the Riley et al. guidance but are too optimistic in the value of R2 that they choose. In such a case, it would retrospectively appear that the study failed to meet the criteria. Therefore, we suggest that all steps of the Riley et al. calculation should be reported in development papers.
Importantly, we found that only 48% of included CPMs surpassed the Riley et al. sample size criteria [
]. Indeed, this study found that there was large spread in the EPP for studies that satisfied the Riley et al. criteria (Supplementary Fig. 1), demonstrating that a single threshold value for EPP is insufficient; this supports existing research in this area [
] resulted in higher sample sizes than the sample sizes that would be needed to meet an EPP10, with the former usually being based on minimizing overfitting (i.e., criteria 1 of Riley et al. [
]). In other words, if a particular study met the 10 EPP rule of thumb, then this does not guarantee that it is likely to meet the Riley criteria, which is often more stringent. This finding reinforces that EPP is not suitable to guide sample size for CPM development.
In addition, we found that studies including internal/external validation alongside model development had higher odds of surpassing the Riley et al. criteria, compared with studies that only included CPM derivation. This finding suggests that sample size improvements are linked to having improved methodology in general within CPM studies. Furthermore, it is currently unknown if adhering to formal sample size criteria leads to greater generalizability of predictive performance [
]. Lack of generalizability of CPMs might lead to individual institutes/centers/research groups developing de novo models on local data, and this could further compound the issues with low sample size. Indeed, lack of sufficient sample size in local settings increases the risk of overfitting. Further research is required to explore both of these points in more detail.
Similarly, further research is needed around sample size requirements for modeling methods such as ordinal, multinomial, competing risks/multistate models and machine learning methods. Indeed, while the sample size criteria outlined by Riley et al. [
] provides guidance for logistic, cox, and linear regression models, this study identified other modeling techniques being used to develop CPMs in prostate cancer, which we excluded due to the lack of sample size formula. For example, owing to the rise in popularity of machine learning techniques, guidance around sample size has become more pertinent [
Several limitations should be considered when interpreting the results of this study. First, the search was completed until June 2019. Prompted by a reviewer comment, we examined a random sample of 10 articles published between June 2019 and November 2020 (time of the first reviewer comments) and we found very similar conclusions; for example, 53% surpassed the Riley sample size criteria, which is similar to the 48% in the main paper. Second, this analysis was retrospective, meaning that most of the included studies were published before the publication of the Riley et al. sample size criteria. Future work should explore if adherence increases in the future. Third, 37 studies were excluded due to insufficient information being reported to calculate the Riley et al. sample size criteria. Such exclusion potentially introduces bias into our findings if such papers differ from those included in terms of their reported sample size relative to the required sample size. Fourth, this review focused on prediction models in the prostate cancer domain; such a domain-specific focus was required because of the large number of CPMs that are published across medical domains. As such, one could argue that the findings might not generalize to other clinical areas.
In conclusion, historically few CPM development studies have justified their choice of sample size, with any justification usually being based on EPP. This systematic review has highlighted that the classic EPP10 rule of thumb is not necessarily sufficient to guide sample size for CPM development, even though it is the most used metric. The findings also show that there is a need to drastically improve the justification of sample sizes used in prediction model studies; justification for how the sample size was arrived at should be the bare minimum regardless of how data were collected (e.g., by conducting a new cohort study, or by using an existing data set already available). The Riley et al. criteria now provide the required means of doing this in the future. Although a historic lack of justification is perhaps unsurprising (given that the Riley et al. criteria have only recently been published), future work should monitor whether this situation improves in the future.
Acknowledgments
Authors' contributions: Authors G.P.M. and N.P. were responsible for the conceptualization of the research. Author S.D.C. was responsible for undertaking the systematic review, collecting all data therefrom and performing the statistical analysis. G.P.M., N.P., and R.D.R. supervised the project. S.D.C. and G.P.M. wrote the original draft of the paper, whereas all authors (S.D.C., N.P., R.D.R., and G.P.M.) revised and edited the paper critically for scientific content.
The management of active surveillance in prostate cancer: validation of the Canary Prostate Active Surveillance Study risk calculator with the Spanish Urological Association Registry.
The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration.
Genomic classifier augments the role of pathological features in identifying optimal candidates for adjuvant radiation therapy in patients with prostate cancer: development and internal validation of a multivariable prognostic model.
Development and internal validation of prediction models for biochemical failure and composite failure after focal salvage high intensity focused ultrasound for local radiorecurrent prostate cancer: presentation of risk scores for individual patient prognoses.
Multivariable model development and internal validation for prostate cancer specific survival and overall survival after whole-gland salvage Iodine-125 prostate brachytherapy.