External validation of clinical prediction models: simulation-based sample size calculations were more reliable than rules-of-thumb

Introduction: Sample size ‘rules-of-thumb’ for external validation of clinical prediction models suggest at least 100 events and 100 non-events. Such blanket guidance is imprecise, and not specific to the model or validation setting. We investigate factors affecting precision of model performance estimates upon external validation, and propose a more tailored sample size approach. Methods: Simulation of logistic regression prediction models to investigate factors associated with precision of performance estimates. Then, explanation and illustration of a simulation-based approach to calculate the minimum sample size required to precisely estimate a model’s calibration, discrimination and clinical utility. Results: Precision is affected by the model’s linear predictor (LP) distribution, in addition to number of events and total sample size. Sample sizes of 100 (or even 200) events and non-events can give imprecise estimates, especially for calibration. The simulation-based calculation accounts for the LP distribution and (mis)calibration in the validation sample. Application identifies 2430 required participants (531 events) for external validation of a deep vein thrombosis diagnostic model. Conclusion: Where researchers can anticipate the distribution of the model’s LP (e.g. based on development sample, or a pilot study), a simulation-based approach for calculating sample size for external validation offers more flexibility and reliability than rules-of-thumb. as it is easier to interpret. The results suggest that a sample size of 2430 participants (531 outcome events) is required, which is driven by the sample size required to estimate calibration slope precisely. Clearly, if calibration is


Introduction 1
Clinical prediction models utilise multiple variables (predictors) in combination to predict an individual patient's risk of a clinical outcome [1][2][3]. An important part of prediction model research is assessing the predictive performance of a model, in terms of whether the model's predicted risks: (i) discriminate between individuals that have the outcome and those that do not, and (ii) calibrate closely with observed risks (i.e. predicted risks are accurate). This can be done by internal validation (such as bootstrapping) using the development data, and by external validation using independent data (i.e. data different to that used for model development). Examining clinical utility (e.g. a model's net benefit) is also important if the model is to be used to change (e.g. treatment) strategies in clinical practice when predicted risks are above a particular threshold [4][5][6].
In contrast to model development studies [7][8][9][10], relatively little research has been published on the sample size needed to externally validate a prediction model. For a binary outcome, often the number of events is used as the effective sample size [2], and therefore larger sample sizes are needed in settings where the outcome is rare. Steyerberg suggests having at least 100 events and 100 non-events for statistical tests to have 'reasonable power' in an external sample, but preferably >250 events and >250 non-events to have power to detect small but still important invalidity. [11] Other simulation and resampling studies conducted by Vergouwe et al. [12], Collins et al. [13], and van Calster et al [14], also suggest having at least 100 events and 100 non-events to ensure accurate and precise estimates of performance measures, and even larger sample sizes (a minimum of 200 events and 200 non-events) to derive flexible calibration curves [13,14].
In this article, we evaluate whether the rule-of-thumb of having at least 100 (or 200) events and non-events is adequate for external validation of a prediction model with a binary outcome. A simulation study is used to investigate the relationship between various factors and precision of performance measures. Based on this, we suggest that sample size needs to be tailored to the setting of interest and propose a more flexible simulation-based approach to do this. Section 2 introduces predictive performance measures and describes the methods used for the simulation study and our simulation-based sample size calculation. Section 3 gives the results and the sample size approach is illustrated for validation of a prediction model for deep vein thrombosis (DVT). Finally, Section 4 provides some discussion.

Predictive performance measures and a motivating example
Consider a prediction model, developed using logistic regression for a binary outcome, that is to be externally validated. It will take the form, where is the predicted probability of the outcome for individual i, α is the intercept, and the X and β terms represent the observed predictor values and predictor effects (log odds ratios) respectively. The right-hand side of the equation is often referred to as the linear predictor (LP). The predictive performance of a model is usually evaluated by estimating measures of calibration, discrimination and clinical utility, as defined in Box 1.  A calibration plot is also essential to visually demonstrate the range of predicted risks, and their calibration with observed risks, ideally using a flexible (e.g. loess smoothed) calibration curve. [14,15] The integrated calibration index (ICI) can be calculated to quantify the difference between the smoothed calibration curve and the ideal 45 degree line. [16] A similar measure is the estimated calibration index (ECI) [14].

Discrimination
 Discrimination is assessed through the C-statistic, which for a binary outcome is equivalent to the area under the receiver operating characteristic curve. Values typically range from 0.5 for a model that discriminates no better than chance alone, through to 1 which would represent perfect discrimination.

Net benefit
 The overall consequences of using a prediction for clinical decisions can be measured using the net benefit, [4,6] which expresses the relative value of benefits and harms associated with using the model to determine clinical decisions. Net benefit ( ) is where sensitivity and specificity of the model predictions depend on the chosen risk threshold value for which clinical decisions are deemed necessary

Simulation study to investigate factors that influence the precision of performance estimates
We hypothesised that four factors relating to the external validation sample could affect the precision of performance estimates: (i) the outcome proportion, (ii) the total sample size, (iii) the standard deviation of the LP values, and (iv) the true (mis)calibration of the model. We conducted a simulation study to investigate this, as now described.

Scenarios for the simulation study
We assumed the prediction model (which is to be external validated) has a LP that is normally distributed; LP i ~ Normal(μ, σ 2 ). Scenarios for the simulations were defined using different values of σ (standard deviation of LP) and μ (mean of LP), as given in Table 1. The value of μ was selected to correspond to a particular 'base probability' (p = inverse logit(μ) = 1/(1+exp(-μ)). This is the outcome event probability for an individual who has the mean LP value; alternatively, it can be considered the expected probability of an event in a population where σ = 0 and so LP is μ for all participants. When σ = 0, the base probability would be equal to the incidence (for prognostic studies) or prevalence (for diagnostic studies).
We selected base probabilities to cover a wide range, from rare outcomes (~0.05) to common outcomes (0.5). Values for σ were chosen to provide a narrow through to a wide range of predicted probabilities from the model (depending on the outcome event proportion), as shown in Figure 1. This also reflects low through to high values of the Cstatistic, as the C-statistic will increase with wider distributions of LP. C-statistic values covered by the scenarios ranged from 0.55 when σ=0.2 to 0.75 when σ=1.0 and base probability=0.05.   Table 1 2

.2.2 Main simulation process
The steps for the main simulation were as follows: 1) Define the simulation scenario by specifying σ and μ, with the latter corresponding to the 'base probability' of the outcome (p = 1/(1+exp(-μ)). Also specify the desired expected number of events (E) in a population where all individuals have the base probability.
2) Set the validation dataset's sample size (N) using E divided by the base probability 3) Generate LP values for each patient in the dataset using LP i ~ Normal(μ, σ 2 ). We also examined whether the precision was adequate when the sample size met the ruleof-thumb of 100 (or 200) events.

Extensions to miscalibration
Step 4 assumes that the prediction model's LP is correct, such that the true calibration model is perfect (i.e. intercept and slope are 0 and 1, respectively, in Equation 2). Therefore, the scenarios were also extended to assess the effect of miscalibration. To do so, steps 1-3 remained the same but then we also created, LP miscal in which the original LP was multiplied

Proposal for simulation-based sample size calculations
Rather than using a rule-of-thumb, we propose a simulation-based approach to identify the sample size required to achieve precise performance estimates. The proposal follows similar steps to that described previously for our simulation study, except now the process is iterative and converges when the minimum sample size is achieved. It is summarised in Box 2, and requires the researcher to specify the desired precision for each performance measure of interest (calibration, discrimination, net benefit, etc), the model's anticipated LP distribution in the validation population, and whether or not the model is well calibrated (i.e.

the values of parameters γ and S of the calibration model in Equation 2).
A sensible starting point is to assume the model is well calibrated (i.e. γ = 0 and S = 1) and that the LP distribution is the same as that for the development study, especially if the validation population is similar to development population. The LP distribution may be obtained directly from the development study's publication or authors; if unavailable, it can, be calculated indirectly using other information, such as the reported C-statistic or the distribution for each outcome group (e.g. displayed at the bottom of a calibration plot). [17][18][19] If the validation population is considered different from the development population (e.g. due to change in expected outcome proportion and/or case-mix), a pilot study may be necessary to gauge the distribution better. Further advice is given in the Supplementary Material.  [20]. The model contained eight predictors, and overfitting was not a major concern given a large number of events per predictor. The model's linear prediction distribution was reported for their development cohort and also other settings, and we use this to illustrate our simulation-based approach for calculating the sample size for an external validation study of the DVT model. Example code is given in the supplementary material for Stata and is available on github for R (https://github.com/gscollins1973/External-validation-sample-size). 3) Specify the target precision for each performance measure.

4) Specify a starting sample size of the validation study
For example, starting with N=100.

5) Generate LP and true outcomes values for each participant.
Randomly generate the LP value for each participant using the distribution in step 1. Then, calculate the logit(p i ) value for each participant using the calibration model specified in step 2.

6) Calculate performance measures of interest for the prediction model in the external
validation dataset: store estimates and 95% CIs.
For example, by comparing the model's predicted LP value and the true outcome value for all participants in the dataset, estimate the C-statistic, calibration slope, calibration-in-the-large, E/O statistic, net benefit (at particular risk thresholds).

7) Repeat steps 5 and 6 for a specified number of repetitions.
For example, 500 repetitions.

8) Using the stored estimates to calculate estimates of precision for each performance measure.
For example, the mean 95% CI width across the repetitions can be stored as the estimate of precision.

Factors associated with the precision of model performance estimates:
results from simulation study 3

.1.1 Precision of the estimated C-statistic
The simulation scenarios (Table 1, Figure 1) represented models with C-statistics from 0.56 (when σ=0.2) through to 0.75 (when σ=1.0). Figure 2 (Panels A & B) show that estimates of the C-statistic were more precise (based on the average 95% CI width) when the outcome was rare compared to a more common outcome, for a particular average number of events. This is likely because the total sample size needs to be much larger for a rare outcome to achieve the same number of expected events compared to a more common outcome. The standard deviation (σ) of the LP also affected the precision of the C-statistic ( Figure 2, . The estimates were more precise when σ was larger, although the difference in the width of the 95% CI for the C-statistic when σ=1 compared to when σ=0.2 was only between 0.02 and 0.06 (depending on base probability) even for studies with 50 expected events. As precision increased with increasing SD(LP) for the scenarios considered, we would therefore expect even larger C-statistics (e.g. >0.8) than considered here to be even more precise.
If an outcome was common (base probability=0.5) and σ=1.0, the average 95% CI widths were 0.14 and 0.09 with 100 and 200 expected outcome events and non-events, respectively (N=200 and N=400, respectively), (as seen in Figure 2, Panel B or D). Therefore, with 200 events, a typical 95% CI would range from about 0.69 to 0.78. If we wanted a more precise estimate, say with a 95% CI width of 0.05, we would need at least 700 events (N=1400).

Precision of the estimated calibration slope
Estimates of the calibration slope can be very imprecise when the number of events is low.
For example in Figure 3, across all scenarios the average 95% CI width is greater than 0.5 when there are around 50 outcome events, but can still be wide for studies with 100 or even 200 events when the outcome event proportion is high or σ is small. As seen in Figure 3 (panels C & D) when σ=0.2, the average width of the 95% CI for the calibration slope is > 1 even for large studies with approximately 500 events on average. Although not as dramatic, estimates also become less precise as the base probability (and therefore the outcome event proportion) moves towards 0.5 (Figure 3, panels A & B). Again, this is likely to be A C B D related to the difference in total sample size required to achieve the same number of events when the base probability differs.
If we wanted the average 95% CI width to be very precise, say a width of 0.2, we would need at least 400 outcome events if the outcome was rare (base probability=0.05) and the spread of the LP was large (σ=1.0). If we aimed for a 95% CI width of 0.4 (e.g. 95% CI: 0.8 to 1.2), this would be achievable with 100 outcome events when the outcome was rare (base probability=0.05) and σ=1.0, but would require more than 300 outcome events if the outcome was more common (base probability>0.4 | σ=1.0) or if the distribution of LP was narrower (σ<1.0 | base probability=0.05).

Precision of the estimated calibration-in-the-large and O/E statistic
The 95% CIs for calibration-in-the-large were wide for low numbers of events, which indicates that in many circumstances 100 events is unlikely to be enough to obtain precise estimates (e.g. a 95% CI width > 0.4). The standard deviation of the LP did not affect the precision much (Figure 4, panels C & D). However, differences were seen for different base probabilities (Figure 4, panels A & B). Findings for the ratio between observed and expected outcomes (O/E) were similar to those observed for calibration-in-the-large (Supplementary Figure S1).

Extensions to scenarios with miscalibration
For the scenarios with miscalibration, each model was evaluated in different datasets (in which the model would be miscalibrated by varying degrees, as specified in Section 2.2.3).
The precision in performance estimates was not greatly affected by miscalibration when the average number of observed events was still similar to that expected upon validation.
However, performance estimates were less precise when miscalibration resulted in fewer events observed than expected. Supplementary Table S2 gives an example.

Application of simulation-based sample size calculation to go beyond current rules-of-thumb
The simulation study confirms that the precision in estimates of a model's predictive performance are affected by the standard deviation of the LP (σ), the outcome proportion (overall outcome risk), the number of events, and the total sample size. In contrast, adhering to blanket rules-of-thumb (e.g. using 100 events) ignores these intricacies and fails to give precise performance estimates in some settings.
In contrast, our simulation-based approach to sample size calculation can be tailored to the model and population at hand (Section 2.3). That is, if researchers can specify the likely distribution of the model's LP and the outcome proportion in the target population, they can then use the simulation-based approach to identify a suitable sample size to ensure predictive performance estimates are precise.
To illustrate this, consider external validation of Debray's diagnostic prediction model for DVT (introduced in Section 2.3.1) [20], and the required sample size if: a. the model is validated in the same population as the development cohort, and the model is expected to be well calibrated (γ = 0 and S = 1 in Equation 2). b. the model is validated in same population as the development cohort, but the model is expected to be miscalibrated (e.g. due to overfitting) (γ = 0 and S = 0.9 in Equation   2). c. The outcome event proportion differs from the development data, either due to different case-mix or miscalibration of the model.
We consider these in turn, and compare to the rule-of-thumb of 100 or 200 events.

Validation in the same population with good calibration
Debray et al. reported that in the development cohort the model's LP followed an approximate Normal(-1.75, 1.47 2 ) distribution [20]. Assuming the external validation study has the same distribution, and that the model is well calibrated (γ = 0 and S = 1 in Equation   2), we conducted simulations of external validation studies that have an average of 100 or 200 events. Table 2 shows the mean of the 95% CI widths for a range of calibration, discrimination and clinical utility measures. The 95% CI is fairly narrow for the C-statistic even when there are 100 events (mean width 0.09); it is also narrow for the integrated calibration index and net benefit (at an arbitrary clinical risk threshold of 0.1 for illustration).
However, calibration-in-the-large and calibration slope estimates are imprecise with 100 events (e.g. mean CI width 0.46 for slope, and even with 200 events (e.g. mean CI width 0.33 for slope).
Using the simulation-based process described in Box 2, we calculated the minimum sample sizes need to obtain average 95% CI widths of 0.1, 0.2, and 0.2 for the C-statistic, calibration slope, and ln(O/E), respectively (Table 3). This corresponds to an expected 95% CI of about 0.77 to 0.87 for the C-statistic, 0.9 to 1.1 for the calibration slope, and 0.9 to 1.1 for O/E, which we deemed precise for making strong inferences. We focus on the precision of O/E rather than calibration-in-the-large as it is easier to interpret. The results suggest that a sample size of 2430 participants (531 outcome events) is required, which is driven by the sample size required to estimate calibration slope precisely. Clearly, if calibration is considered less relevant to, say, net benefit then a lower number may be sufficient for this particular model, given the narrow 95% CI width for net benefit even with 100 events (Table   2). However, calibration is an under-appreciated measure, and indeed linked to net-benefit [5], so we recommend it is nearly always important to assess.

Validation in the same population but assuming miscalibration
Now we assume that in the validation population the model has the same LP distribution as in the development sample (Normal(-1.75, 1.47 2 )), but that the true calibration slope (S in Equation 2) is 0.9 (e.g. due to slight overfitting that was unaccounted for during model development) and γ is a non-zero value that ensures the outcome proportion is still 0.22 in the population. Aiming for the same CI widths as in the previous example, our simulationbased calculation now identifies the sample size required is 2141 participants (471 outcome events), again driven by the calibration slope. When the true calibration slope is assumed 0.8, the required sample size is lower still (1900 participants, 416 outcome events). Hence, the required sample size is lower the larger the miscalibration assumed.

Validation in a different population with a different case-mix or event proportion
Lastly, consider a very different population from the development dataset, as shown by Debray et al. [20], where the outcome proportion is lower at 0.13 and the prediction model's LP distribution has changed (Normal(-2.67, 1.56 2 )), due to a different case-mix. Assuming the model is well calibrated in terms of the slope (S = 1 in Equation 2), but setting the γ parameter value to a non-zero value so that the outcome event proportion of 0.13 is achieved in the population, our simulation-based approach identifies that 3156 participants (and 400 outcome events) are required, again driven by ensuring precise estimation of the calibration slope. This is substantially more than the 100 or 200 outcome events rule-ofthumb.

Discussion 4
Sample size for external validation studies should ensure precise estimates of performance estimate of performance measures of interest (e.g. calibration, discrimination, clinical utility).
Our simulation study shows that rules-of-thumb such as requiring a minimum of 100 events and 100 non-events (or even 200) do not give precise estimates in all scenarios, especially where calibration is of interest. Further, the precision of the C-statistic, calibration slope and calibration-in-the-large depends not only on the number of expected events, but also on the event proportion and therefore the overall sample size, as well as the distribution of the LP.
Our proposed simulation-based approach accounts for these aspects, and is thus more flexible and reliable. Our examples illustrate how it calculates the required sample size for the particular model and validation setting of interest, and allows situations assuming calibration or miscalibration to be examined.
The sample sizes based on precision of performance statistics generally result in larger sample sizes than the rules-of-thumb, especially where calibration is of interest, in particular to estimate calibration slope precisely as demonstrated in our applied example (where 531 outcome events were deemed necessary) and the simulation study (e.g. see Figure 3 and Section 3.2.2). This contrasts work by others which showed that fewer than 100 events were required in some cases for validation of scoring systems based on logistic regression [21].
However, their calculations were based on achieving smooth calibration plots rather than ensuring precise estimates of numerically quantifying calibration. Applied examples also show imprecise estimates even when there are more than 100 events. For example, external validation of a prediction model for adverse outcomes in pre-eclampsia used a dataset with 185 events, and yet the 95% CIs for the C-statistic (0.64 to 0.86) and the calibration slope were wide (0.48 to 1.32) [22].
Our proposal to base sample size on precision of performance estimates is in line with Jinks et al., who suggest precisely estimating Royston's D statistic for survival prediction models. [23] Our simulation approach is more generalizable, as it can assess multiple performance measures simultaneously, and can be adapted for any outcome data type (e.g. continuous, binary or survival). For survival data, simulations would also need to specify the censoring mechanism and key time-points of interest.
We focused on precise estimates of calibration, discrimination and clinical utility. Although the researcher should define the measures of key interest, generally we recommend that all are important to consider. Calibration and clinical utility, in particular, are often underappreciated [24][25][26]. By ensuring precise estimates of calibration in terms of O/E (or the calibration-in-the-large) and calibration slope, this will help construct a reliable calibration plot. However, precise estimates across the entire range of predictions (e.g within each tenth of predicted risk from 0 to 1), would likely require even larger sample sizes. The simulationbased approach could also be extended to determine the sample size required to directly compare models, but again larger sample sizes are likely. If an external dataset is already available (i.e. sample size is fixed), the approach can be used to ascertain the expected precision for that particular sample size and observed linear predictor distribution (to help justify its suitability).
We assumed that the linear predictor is normally distributed, which is supported by empirical evidence in some areas [27,28]. However, the simulation-based sample size approach (Box 2) can easily be adapted to use other distributions for the LP, as appropriate. If the prediction model contains only binary or categorical predictors, a discrete distribution may be more appropriate, whereas for skewed or more flexible shapes, a beta or gamma distribution may be preferable. Advice for obtaining the LP distribution is given in Section 2.3 and the supplementary material.
We recognise that what is 'precise' is subjective. Our examples in Section 3.2 gave suggestions for the O/E, calibration slope and C-statistic based on particular 95% CI widths.
The simulation-based calculation identifies the sample size that is expected to give (i.e. on average) CIs of the desired width. An alternative is to identify the sample size that gives CIs that are no wider than the desired width on, say, 95% of simulations. This would be even more reassuring but requires even larger sample sizes.
In summary, we propose that precise performance estimates should be targeted when planning external validation studies, and a tailored sample size can be determined through simulation by specifying the likely distribution of the LP, the outcome event proportion and target precision for each performance measure. The sample size that, on average, gives the target precision for all performance measures should be selected for the external validation data.