Highlights
 •After a clinical prediction model is developed, it is usually necessary to undertake an external validation study that examines the model's performance in new data from the same or different population. External validation studies should have an appropriate sample size, in order to estimate model performance measures precisely for calibration, discrimination and clinical utility.
 •Rulesofthumb suggest at least 100 events and 100 nonevents. Such blanket guidance is imprecise, and not specific to the model or validation setting.
 •Our works shows that precision of performance estimates is affected by the model's linear predictor (LP) distribution, in addition to number of events and total sample size. Furthermore, sample sizes of 100 (or even 200) events and nonevents can give imprecise estimates, especially for calibration.
 •Our new proposal uses a simulationbased sample size calculation, which accounts for the LP distribution and (mis)calibration in the validation sample, and calculates the sample size (and events) required conditional on these factors.
 •The approach requires the researcher to specify the desired precision for each performance measure of interest (calibration, discrimination, net benefit, etc), the model's anticipated LP distribution in the validation population, and whether or not the model is well calibrated. Guidance for how to specify these values is given, and R and Stata code is provided.
Abstract
Introduction
Methods
Results
Conclusion
Key words
 •Existing rulesofthumb, such as having 100 events and 100 nonevents, for the sample size required for external validation studies for prediction models of binary outcomes may not ensure precise performance estimates, particularly for calibration measures.
 •Precision of performance estimates is affected by the model's linear predictor distribution, in addition to the number of events and total sample size.
Key findings
 •Our simulation study shows that more than 200 events and nonevents are often needed to achieve precise estimates of calibration, and the actual sample size calculation should be tailored to the setting and model of interest.
 •Our new proposal uses a simulationbased sample size calculation, which accounts for the linear predictor distribution and (mis)calibration in the validation sample, and calculates the sample size (and events) required conditional on these factors.
What this adds to what is known
 •Precise performance estimates should be targeted when externally validating prediction models for binary outcomes and this can be done through simulation. The approach requires the researcher to specify the desired precision for each performance measure of interest (calibration, discrimination, net benefit, etc), the model's anticipated linear predictor distribution in the validation population, and whether or not the model is expected to be well calibrated.
What is the implication, what should change now
1. Introduction
2. Methods
2.1 Predictive performance measures and a motivating example
where ${p}_{i}$ is the predicted probability of the outcome for individual i, α is the intercept, and the X and β terms represent the observed predictor values and predictor effects (log odds ratios) respectively. The righthand side of the equation is often referred to as the linear predictor (LP). The predictive performance of a model is usually evaluated by estimating measures of calibration, discrimination and clinical utility, as defined in Box 1.
Calibration

2.2 Simulation study to investigate factors that influence the precision of performance estimates
2.2.1 Scenarios for the simulation study
2.2.2 Main simulation process
 1)Define the simulation scenario by specifying σ and μ, with the latter corresponding to the “base probability” of the outcome (p = 1/(1+exp(μ)). Also specify the desired expected number of events (E) in a population where all individuals have the base probability.
 2)Set the validation dataset's sample size (N) using E divided by the base probability.
 3)Generate LP values for each patient in the dataset using LPi ~ Normal(μ, σ^{2}).
 4)Generate binary outcomes (Yi = 0 or 1) for patient's by Yi ~ Bernoulli( 1/(1+exp(LPi))).
 5)Estimate with 95% confidence intervals (CIs) the model's calibration and discrimination performance using the external validation dataset.
 6)Repeat steps 2  4 a total of 500 times for each simulation scenario. 500 repetitions was used to ensure a small Monte Carlo error whilst ensuring computation time was acceptable.
 7)For each performance measure, calculate the average estimate and the average precision (based on the average 95% CI width) the 500 results.
2.2.3 Extensions to miscalibration
2.3 Proposal for simulationbased sample size calculations

2.3.1 Applied example: Diagnostic model for deep vein thrombosis
3. Results
3.1 Factors associated with the precision of model performance estimates: results from simulation study
3.1.1 Precision of the estimated Cstatistic
Factor  Values 

Standard deviation of the LPi (σ)  0.2, 0.4, 0.6, 0.8, 1.0 
Base probability (inverse logit(μ))  0.05, 0.1, 0.2, 0.3, 0.4, 0.5 
Expected number of events (E)  50, 100, 150, 200, …, 800 
3.1.2 Precision of the estimated calibration slope
3.1.3 Precision of the estimated calibrationinthelarge and O/E statistic
3.1.4 Extensions to scenarios with miscalibration
3.2 Application of simulationbased sample size calculation to go beyond current rulesofthumb
 a.the model is validated in the same population as the development cohort, and the model is expected to be well calibrated (γ = 0 and S = 1 in Eq. 2).
 b.the model is validated in same population as the development cohort, but the model is expected to be miscalibrated (eg, due to overfitting) (γ = 0 and S = 0.9 in Eq. 2).
 c.The outcome event proportion differs from the development data, either due to different casemix or miscalibration of the model.
3.2.1 Validation in the same population with good calibration
Performance Measure  N = 461 (~100 events on average in each validation dataset)  N = 922 (~ 200 events on average in each validation dataset)  

Mean of the 1,000 estimates  Average width of 1,000 95% CIs  Mean of the 1,000 estimates  Average width of the 1,000 95% CIs  
Cstatistic  0.817  0.09  0.816  0.06 
Calibration slope  1.016  0.46  1.008  0.33 
Observed/expected  1.000  0.35  1.002  0.25 
Integrated calibration index  0.020  0.04  0.014  0.03 
Net benefit at a risk threshold of 0.1  0.153  0.08  0.154  0.06 
Performance Measure  Targeted 95% CI width  Sample size (events) required to achieve CI width 

Cstatistic  0.1  385 (85) 
Calibration slope  0.2  2430 (531) 
Ln(observed/expected)  0.2  1379 (302) 
3.2.2 Validation in the same population but assuming miscalibration
3.2.3 Validation in a different population with a different casemix or event proportion
4. Discussion
Appendix. Supplementary materials
References
 Clinical prediction models: a practical approach to development, validation, and updating.Springer, New York, NY2009
 Regression modeling strategies with applications to linear models, logistic regression, and survival analysis.Springer, New York2001
 Prognosis research in healthcare: concepts, methods and impact.Oxford University Press, Oxford, UK2019
 Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests.Bmj. 2016; 352: i6
 Calibration of risk prediction models: impact on decisionanalytic performance.Med Decis Making. 2015; 35: 162169
 Decision curve analysis: a novel method for evaluating prediction models.Med Decis Making. 2006; 26: 565574
 Minimum sample size for developing a multivariable prediction model: Part I  Continuous outcomes.Statistics in med. 2019; 38: 12621275
 Minimum sample size for developing a multivariable prediction model: PART II  binary and timetoevent outcomes.Statistics in med. 2019; 38: 12761296
 Sample size for binary logistic prediction models: Beyond events per variable criteria.Statistical methods in med res. 2018; 962280218784726
 Calculating the sample size required for developing a clinical prediction model.Bmj. 2020; 368: m441
 Clinical prediction models: a practical approach to development, validation, and updating.2nd ed. Springer, 2019
 Substantial effective sample sizes were required for external validation studies of predictive logistic regression models.J clinical epidemiol. 2005; 58: 475483
 Sample size considerations for the external validation of a multivariable prognostic model: a resampling study.Statistics in med. 2016; 35: 214226
 A calibration hierarchy for risk models was defined: from utopia to empirical data.J Clin Epidemiol. 2016; 74: 167176
 Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers.Statistics in med. 2014; 33: 517535
 The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models.Statistics in med. 2019; 38: 40514065
 A guide to systematic review and metaanalysis of prediction model performance.Bmj. 2017; 356: i6460
 Interpreting the concordance statistic of a logistic regression model: relation to the variance and odds ratio of a continuous explanatory variable.BMC med res methodol. 2012; 12: 82
 A framework for metaanalysis of prediction model studies with binary and timetoevent outcomes.Stat Methods Med Res. 2019; 28: 27682786
 A new framework to enhance the interpretation of external validation studies of clinical prediction models.J clin epidemiol. 2015; 68: 279289
 Sample size calculation to externally validate scoring systems based on logistic regression models.PLoS One. 2017; 12e0176726
 Prediction of complications in earlyonset preeclampsia (PREP): development and external multinational validation of prognostic models.BMC med. 2017; 15: 68
 Discriminationbased sample size calculations for multivariable prognostic models for timetoevent data.BMC med res methodol. 2015; 15: 82
 Reporting and methods in clinical prediction research: a systematic review.Plos Med. 2012; 9
 External validation of multivariable prediction models: a systematic review of methodological conduct and reporting.BMC med res methodol. 2014; 14: 40
 Evaluating diagnostic tests & prediction models' of the Stratos initiative. Calibration: the Achilles heel of predictive analytics.BMC med. 2019; 17: 230
 Assessing risk prediction models using individual participant data from multiple studies.Am J Epidemiol. 2014; 179: 621632
 Comparing distributions of polygenic risk scores of type 2 diabetes and coronary heart disease within different populations.PLoS One. 2017; 12e0179238
Article info
Publication history
Footnotes
Funding: KIES is funded by a National Institute for Health Research School for Primary Care Research (NIHR SPCR) launching fellowship. RDR and LA are supported by funding from the Evidence Synthesis Working Group, which is funded by the National Institute for Health Research School for Primary Care Research (NIHR SPCR) [Project Number 390]. TD is supported by the Netherlands Organisation for Health Research and Development (grant number: 91617050) LJB was supported by a PostDoctoral Fellowship (Dr Laura Bonnett  PDF201508044) from the National Institute for Health Research. BP is supported by a NIHR PostDoctoral Fellowship (PDF 201410872). GC is supported by Cancer Research UK (programme grant: C49297/A27294) and the NIHR Biomedical Research Centre, Oxford. This publication presents independent research funded by the National Institute for Health Research (NIHR). The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care.
Conflicts of interest: The authors have no conflicts of interest.
Author contributions: KS and RR developed the research idea, which stemmed from discussions on sample size with all authors, building on applied prediction model projects (BP, TD), training courses (GC, JE, LB, LA), and PhD topics (LA). KS undertook the simulation study and developed the code for the simulationbased sample size calculation, with support from JE and RR. KS applied the approaches to the applied examples, with support from TD and RR. KS drafted the article, which was then revised by KS and RR following comments and suggestions from all authors. All authors contributed to further revisions and responses to reviewers.
Identification
Copyright
User license
Creative Commons Attribution (CC BY 4.0) Permitted
 Read, print & download
 Redistribute or republish the final article
 Text & data mine
 Translate the article
 Reuse portions or extracts from the article in other works
 Sell or reuse for commercial purposes
Elsevier's open access license policy