Highlights
- •After a clinical prediction model is developed, it is usually necessary to undertake an external validation study that examines the model's performance in new data from the same or different population. External validation studies should have an appropriate sample size, in order to estimate model performance measures precisely for calibration, discrimination and clinical utility.
- •Rules-of-thumb suggest at least 100 events and 100 nonevents. Such blanket guidance is imprecise, and not specific to the model or validation setting.
- •Our works shows that precision of performance estimates is affected by the model's linear predictor (LP) distribution, in addition to number of events and total sample size. Furthermore, sample sizes of 100 (or even 200) events and non-events can give imprecise estimates, especially for calibration.
- •Our new proposal uses a simulation-based sample size calculation, which accounts for the LP distribution and (mis)calibration in the validation sample, and calculates the sample size (and events) required conditional on these factors.
- •The approach requires the researcher to specify the desired precision for each performance measure of interest (calibration, discrimination, net benefit, etc), the model's anticipated LP distribution in the validation population, and whether or not the model is well calibrated. Guidance for how to specify these values is given, and R and Stata code is provided.
Abstract
Introduction
Methods
Results
Conclusion
Key words
- •Existing rules-of-thumb, such as having 100 events and 100 non-events, for the sample size required for external validation studies for prediction models of binary outcomes may not ensure precise performance estimates, particularly for calibration measures.
- •Precision of performance estimates is affected by the model's linear predictor distribution, in addition to the number of events and total sample size.
Key findings
- •Our simulation study shows that more than 200 events and non-events are often needed to achieve precise estimates of calibration, and the actual sample size calculation should be tailored to the setting and model of interest.
- •Our new proposal uses a simulation-based sample size calculation, which accounts for the linear predictor distribution and (mis)calibration in the validation sample, and calculates the sample size (and events) required conditional on these factors.
What this adds to what is known
- •Precise performance estimates should be targeted when externally validating prediction models for binary outcomes and this can be done through simulation. The approach requires the researcher to specify the desired precision for each performance measure of interest (calibration, discrimination, net benefit, etc), the model's anticipated linear predictor distribution in the validation population, and whether or not the model is expected to be well calibrated.
What is the implication, what should change now
1. Introduction
2. Methods
2.1 Predictive performance measures and a motivating example
where is the predicted probability of the outcome for individual i, α is the intercept, and the X and β terms represent the observed predictor values and predictor effects (log odds ratios) respectively. The right-hand side of the equation is often referred to as the linear predictor (LP). The predictive performance of a model is usually evaluated by estimating measures of calibration, discrimination and clinical utility, as defined in Box 1.
Calibration
|
2.2 Simulation study to investigate factors that influence the precision of performance estimates
2.2.1 Scenarios for the simulation study

2.2.2 Main simulation process
- 1)Define the simulation scenario by specifying σ and μ, with the latter corresponding to the “base probability” of the outcome (p = 1/(1+exp(-μ)). Also specify the desired expected number of events (E) in a population where all individuals have the base probability.
- 2)Set the validation dataset's sample size (N) using E divided by the base probability.
- 3)Generate LP values for each patient in the dataset using LPi ~ Normal(μ, σ2).
- 4)Generate binary outcomes (Yi = 0 or 1) for patient's by Yi ~ Bernoulli( 1/(1+exp(-LPi))).
- 5)Estimate with 95% confidence intervals (CIs) the model's calibration and discrimination performance using the external validation dataset.
- 6)Repeat steps 2 - 4 a total of 500 times for each simulation scenario. 500 repetitions was used to ensure a small Monte Carlo error whilst ensuring computation time was acceptable.
- 7)For each performance measure, calculate the average estimate and the average precision (based on the average 95% CI width) the 500 results.
2.2.3 Extensions to miscalibration
2.3 Proposal for simulation-based sample size calculations
|
2.3.1 Applied example: Diagnostic model for deep vein thrombosis
3. Results
3.1 Factors associated with the precision of model performance estimates: results from simulation study
3.1.1 Precision of the estimated C-statistic
Factor | Values |
---|---|
Standard deviation of the LPi (σ) | 0.2, 0.4, 0.6, 0.8, 1.0 |
Base probability (inverse logit(μ)) | 0.05, 0.1, 0.2, 0.3, 0.4, 0.5 |
Expected number of events (E) | 50, 100, 150, 200, …, 800 |

3.1.2 Precision of the estimated calibration slope

3.1.3 Precision of the estimated calibration-in-the-large and O/E statistic

3.1.4 Extensions to scenarios with miscalibration
3.2 Application of simulation-based sample size calculation to go beyond current rules-of-thumb
- a.the model is validated in the same population as the development cohort, and the model is expected to be well calibrated (γ = 0 and S = 1 in Eq. 2).
- b.the model is validated in same population as the development cohort, but the model is expected to be miscalibrated (eg, due to overfitting) (γ = 0 and S = 0.9 in Eq. 2).
- c.The outcome event proportion differs from the development data, either due to different case-mix or miscalibration of the model.
3.2.1 Validation in the same population with good calibration
Performance Measure | N = 461 (~100 events on average in each validation dataset) | N = 922 (~ 200 events on average in each validation dataset) | ||
---|---|---|---|---|
Mean of the 1,000 estimates | Average width of 1,000 95% CIs | Mean of the 1,000 estimates | Average width of the 1,000 95% CIs | |
C-statistic | 0.817 | 0.09 | 0.816 | 0.06 |
Calibration slope | 1.016 | 0.46 | 1.008 | 0.33 |
Observed/expected | 1.000 | 0.35 | 1.002 | 0.25 |
Integrated calibration index | 0.020 | 0.04 | 0.014 | 0.03 |
Net benefit at a risk threshold of 0.1 | 0.153 | 0.08 | 0.154 | 0.06 |
Performance Measure | Targeted 95% CI width | Sample size (events) required to achieve CI width |
---|---|---|
C-statistic | 0.1 | 385 (85) |
Calibration slope | 0.2 | 2430 (531) |
Ln(observed/expected) | 0.2 | 1379 (302) |
3.2.2 Validation in the same population but assuming miscalibration
3.2.3 Validation in a different population with a different case-mix or event proportion
4. Discussion
Appendix. Supplementary materials
References
- Clinical prediction models: a practical approach to development, validation, and updating.Springer, New York, NY2009
- Regression modeling strategies with applications to linear models, logistic regression, and survival analysis.Springer, New York2001
- Prognosis research in healthcare: concepts, methods and impact.Oxford University Press, Oxford, UK2019
- Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests.Bmj. 2016; 352: i6
- Calibration of risk prediction models: impact on decision-analytic performance.Med Decis Making. 2015; 35: 162-169
- Decision curve analysis: a novel method for evaluating prediction models.Med Decis Making. 2006; 26: 565-574
- Minimum sample size for developing a multivariable prediction model: Part I - Continuous outcomes.Statistics in med. 2019; 38: 1262-1275
- Minimum sample size for developing a multivariable prediction model: PART II - binary and time-to-event outcomes.Statistics in med. 2019; 38: 1276-1296
- Sample size for binary logistic prediction models: Beyond events per variable criteria.Statistical methods in med res. 2018; 962280218784726
- Calculating the sample size required for developing a clinical prediction model.Bmj. 2020; 368: m441
- Clinical prediction models: a practical approach to development, validation, and updating.2nd ed. Springer, 2019
- Substantial effective sample sizes were required for external validation studies of predictive logistic regression models.J clinical epidemiol. 2005; 58: 475-483
- Sample size considerations for the external validation of a multivariable prognostic model: a resampling study.Statistics in med. 2016; 35: 214-226
- A calibration hierarchy for risk models was defined: from utopia to empirical data.J Clin Epidemiol. 2016; 74: 167-176
- Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers.Statistics in med. 2014; 33: 517-535
- The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models.Statistics in med. 2019; 38: 4051-4065
- A guide to systematic review and meta-analysis of prediction model performance.Bmj. 2017; 356: i6460
- Interpreting the concordance statistic of a logistic regression model: relation to the variance and odds ratio of a continuous explanatory variable.BMC med res methodol. 2012; 12: 82
- A framework for meta-analysis of prediction model studies with binary and time-to-event outcomes.Stat Methods Med Res. 2019; 28: 2768-2786
- A new framework to enhance the interpretation of external validation studies of clinical prediction models.J clin epidemiol. 2015; 68: 279-289
- Sample size calculation to externally validate scoring systems based on logistic regression models.PLoS One. 2017; 12e0176726
- Prediction of complications in early-onset pre-eclampsia (PREP): development and external multinational validation of prognostic models.BMC med. 2017; 15: 68
- Discrimination-based sample size calculations for multivariable prognostic models for time-to-event data.BMC med res methodol. 2015; 15: 82
- Reporting and methods in clinical prediction research: a systematic review.Plos Med. 2012; 9
- External validation of multivariable prediction models: a systematic review of methodological conduct and reporting.BMC med res methodol. 2014; 14: 40
- Evaluating diagnostic tests & prediction models' of the Stratos initiative. Calibration: the Achilles heel of predictive analytics.BMC med. 2019; 17: 230
- Assessing risk prediction models using individual participant data from multiple studies.Am J Epidemiol. 2014; 179: 621-632
- Comparing distributions of polygenic risk scores of type 2 diabetes and coronary heart disease within different populations.PLoS One. 2017; 12e0179238
Article info
Publication history
Footnotes
Funding: KIES is funded by a National Institute for Health Research School for Primary Care Research (NIHR SPCR) launching fellowship. RDR and LA are supported by funding from the Evidence Synthesis Working Group, which is funded by the National Institute for Health Research School for Primary Care Research (NIHR SPCR) [Project Number 390]. TD is supported by the Netherlands Organisation for Health Research and Development (grant number: 91617050) LJB was supported by a Post-Doctoral Fellowship (Dr Laura Bonnett - PDF-2015-08-044) from the National Institute for Health Research. BP is supported by a NIHR Post-Doctoral Fellowship (PDF 2014-10872). GC is supported by Cancer Research UK (programme grant: C49297/A27294) and the NIHR Biomedical Research Centre, Oxford. This publication presents independent research funded by the National Institute for Health Research (NIHR). The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care.
Conflicts of interest: The authors have no conflicts of interest.
Author contributions: KS and RR developed the research idea, which stemmed from discussions on sample size with all authors, building on applied prediction model projects (BP, TD), training courses (GC, JE, LB, LA), and PhD topics (LA). KS undertook the simulation study and developed the code for the simulation-based sample size calculation, with support from JE and RR. KS applied the approaches to the applied examples, with support from TD and RR. KS drafted the article, which was then revised by KS and RR following comments and suggestions from all authors. All authors contributed to further revisions and responses to reviewers.
Identification
Copyright
User license
Creative Commons Attribution (CC BY 4.0) |
Permitted
- Read, print & download
- Redistribute or republish the final article
- Text & data mine
- Translate the article
- Reuse portions or extracts from the article in other works
- Sell or re-use for commercial purposes
Elsevier's open access license policy