Multivariate meta-analysis of individual participant data helped externally validate the performance and implementation of a prediction model

Objective: Our aim was to improve meta-analysis methods for summarising a prediction model’s performance when individual participant data are available from multiple studies for external validation. Study design & setting: We suggest multivariate meta-analysis for jointly synthesising calibration and discrimination performance, whilst accounting for their correlation. The approach estimates a prediction model's average performance, the heterogeneity in performance across populations, and the probability of ‘good’ performance in new populations. This allows different implementation strategies (e.g. recalibration) to be compared. Application is made to a diagnostic model for deep vein thrombosis (DVT) and a prognostic model for breast cancer mortality. Results: In both examples multivariate meta-analysis reveals that calibration performance is excellent on average , but highly heterogeneous across populations unless the model’s intercept (baseline hazard) is recalibrated. For the cancer model, the probability of 'good' performance (defined by C-statistic ≥ 0.7 and calibration slope between 0.9 and 1.1) in a new population was 0.67 with recalibration, but 0.22 without recalibration. For the DVT model, even with recalibration there was only a 0.02 probability of 'good' performance. Conclusion : Multivariate meta-analysis can be used to externally validate a prediction model’s calibration and discrimination performance across multiple populations, and to evaluate different implementation strategies.


M A N U S C R I P T A C C E P T E D ACCEPTED MANUSCRIPT
What is new?

KEY FINDINGS
Given individual participant data (IPD) from multiple external validation studies, metaanalysis enables researchers to summarise prediction model performance, in terms of both average performance and consistency in performance across populations.It thereby allows different implementation strategies (e.g.recalibration) to be formally compared.
A multivariate meta-analysis approach should be used to jointly evaluate discrimination and calibration performance, whilst accounting for their correlation.This can be used within internal-external cross-validation (to also incorporate a model development phase), or when IPD from multiple studies are available for external validation of existing models.

WHAT THIS ADDS TO WHAT IS KNOWN:
Before implementation, risk prediction models require validation in data external to that used for model development.This is best achieved using IPD from multiple studies, so that model performance can be examined and quantified across multiple populations of interest.A good prediction model will have satisfactory performance on average across all external validation datasets, and crucially little or no between-study heterogeneity in performance.
Our examples show that a prediction model may have excellent average performance, but with heterogeneity (inconsistency) in performance across populations.Recalibration of the model's intercept term (or baseline hazard) in the intended population might reduce heterogeneity, and thereby improve the probability of acceptable model performance when applied in new populations.

WHAT IS THE IMPLICATION, WHAT SHOULD CHANGE NOW:
When IPD are available from multiple studies for external validation of a prediction model, researchers should use multivariate meta-analysis to jointly summarise calibration and discrimination performance, and to identify how best to implement the model in new populations.

Introduction
A crucial part of medical research is to develop risk prediction models.These aim to accurately predict disease and outcome risk in individuals [1][2][3], thereby informing clinical diagnosis and prognosis.For example, healthy individuals with a high predicted risk of future disease (e.g.cardiovascular events) may be advised to modify their lifestyle and behaviour choices (e.g.smoking, exercise), and diseased individuals may be grouped (e.g.stage of cancer) according to future outcome risk so that clinical decisions (such as treatment options, monitoring strategies) can be tailored accordingly.Two well-known examples are QRISK [4] and the Nottingham Prognostic Index [5].They are typically implemented within a multivariable regression framework, such as logistic or Cox regression, which provides an equation to estimate an individual's risk based on values of multiple predictors (prognostic factors [6]) such as age, biomarkers, and genetic information A key stage of prediction model research is model development [2].This identifies important predictors and develops the risk prediction equation using an available dataset; it usually also examines the model's apparent performance in this same data, or uses internal validation techniques (such as bootstrap resampling) to examine and adjust for optimism in performance [7].The next stage is external validation [8][9][10].This uses data external to the model development data and its source, and examines whether the model predictions are accurate in another (but related) situation.The aim is to ascertain the model's generalisability to the intended populations for use [11], and to identify the best implementation strategy (e.g.recalibration of the intercept).
Unfortunately, most prediction research focuses on model development and there are relatively few external validation studies [12].However, nowadays there is increasing access to multiple datasets, as evident in meta-analyses using individual participant data (IPD) from multiple studies [13,14].This provides an exciting opportunity to perform external validation on multiple occasions [15,16].Model development and external validation can even occur simultaneously, using an approach called internal-external cross-validation [17,18].This develops a model in all but one of the IPD studies, and then its external validity is immediately checked in the omitted study.This process is repeated across all rotations of the omitted study, to measure external validity in each distinct IPD study.

M A N U S C R I P T A C C E P T E D ACCEPTED MANUSCRIPT
Given multiple external validation studies, meta-analysis methods are needed to synthesise and summarise model performance appropriately across the available populations.Van Klaveren et al. [16], Pennells et al. [15] and, within internal-external cross-validation, Royston et al. [17] consider approaches to summarise validation performance across multiple studies or clusters.These focus mainly on producing pooled estimates of discrimination performance; that is, a model's ability to distinguish correctly between patients with and without the outcome of interest.Researchers should also be interested in summarising calibration performance, which is the agreement between a model's predicted risk and the observed risk.
Calibration is often ignored in external validation research [19], even though it is fundamental that observed and predicted risks should closely agree.Moreover, baseline risk may vary across study populations and so a model's implementation may need to be tailored to each population (often referred to as recalibration) in order to improve calibration performance in new populations.
In this article, we propose multivariate meta-analysis for jointly synthesising discrimination and calibration performance, whilst accounting for their correlation.This can be used within internal-external cross-validation (to also incorporate a model development phase), or when IPD from multiple studies are available for external validation of existing models.We show that the multivariate approach summarises a prediction model's average discrimination and calibration performance, and quantifies the heterogeneity in performance across populations.
It also allows researchers to predict the potential calibration and discrimination of a model when it is applied to a new population, and can be used to estimate the probability of 'good' performance (as pre-defined by the user).Using two real examples, we illustrate how this enables researchers to compare the performance of different implementation strategies (e.g.recalibration of the intercept term) to help identify the best strategy for applying the model in practice.
The article now proceeds by introducing the proposed multivariate meta-analysis methodology for summarising and comparing validation performance (Section 2).Two clinical examples are then used to illustrate the approach (Section 3), one for diagnosis and one for prognosis, and we conclude with some discussion (Section 4).

Meta-analysis of predictive performance statistics from multiple external validation studies
External validation of a prediction model requires evaluation of its predictive performance, in terms of both calibration and discrimination.There are many statistical measures available for this purpose [1,20].Here we focus on those most commonly used: the C-statistic [20,21] the D-statistic [22,23], the calibration slope [1,20], calibration-in-the-large [1], and the Expected/Observed number of events.These are defined in the Appendix.We focus here on how to meta-analyse such performance statistics when they are estimated in multiple external validation studies.

Obtaining suitable data for meta-analysis
The meta-analysis approach requires an estimate of each performance statistic of interest (e.g. C-statistic, calibration slope) from each external validation study.Given IPD, these can be calculated in each validation study using appropriate statistical methods, as described elsewhere [1,20,23].However, meta-analysis also requires the variance-covariance matrix of the performance statistics in each study: in other words, the variance of each performance estimate and (for multivariate meta-analysis) the correlation between all pairs of estimates.A general approach to obtaining these is via non-parametric bootstrapping, as described in the Appendix.

Univariate random-effects meta-analysis
For clarity, before proposing our multivariate approach we firstly describe a univariate random-effects meta-analysis that is applicable separately to each performance measure of interest [24,25].In external validation study i let Y ij be the estimate of the j th performance statistic of interest, and let S ij 2 be its sample variance (derived from bootstrapping and assumed known), then the univariate meta-analysis can be written as: Equation ( 1) assumes the Y ij are normally distributed about the i th study's true validation performance, μ ij , and that the μ ij are also normally distributed with an average of ߤ and a between-study standard deviation of ߬ .There are several frequentist methods that can be used for estimation of a random-effects meta-analysis; here we use restricted maximum likelihood (REML) [26].With the addition of prior distributions for unknown parameters, a Bayesian approach is also possible, for example using Gibbs sampling.An approximate 100(1-α)% confidence interval for the average performance, ߤ , is obtained by ߤ ±1.96 SE(ߤ), where SE(ߤ) is the standard error of ߤ. White [27] proposed that SE(ߤ) is inflated to account for the uncertainty in the estimated ߬ , and we implement this here.

Summarising consistency in model performance
On its own, ߤ is an incomplete summary because it does not adequately summarise the consistency in performance across studies.Estimates such as ‫ܫ‬ ଶ (the percentage of the total variation in study estimates that is due to between-study heterogeneity [28]) and ߬̂ ଶ are thus also helpful [29].However, when evaluating performance statistics of a risk prediction model we are examining its generalisability, in other words its robustness when applied in new populations that differ from those it was developed in [11].Thus, consistency is best expressed by a 100(1-ߙ)% prediction interval for the performance of the model in a new population [24,25].This is derived by where t ∝, N -2 is the 100 ቀ1-∝ 2 ቁ % percentile of the t-distribution for N -2 degrees of freedom (N = no. of studies), V(ߤ̂) =SE(ߤ̂) ଶ , and ∝ is typically taken to be 0.05 to give a 95% interval.The use of a t-distribution, rather than a normal distribution, is used to account for the uncertainty in ߬̂ ଶ [24].The prediction interval thus indicates the performance expected in a new (external validation) study, similar to those included in the meta-analysis.

Multivariate meta-analysis
Our multivariate approach is an extension of equation (1) [30], and allows the joint synthesis of all predictive performance measures of interest from the i = 1 to N external validation studies, whilst accounting for their within-and between-study correlation.Let there be j = 1 to J measures of interest, and let i Y be a vector containing the available J estimates ( ) of the measures in the i th validation study.The general multivariate metaanalysis model is as follows: Here MVN denotes a multivariate normal distribution, i θ contains the true underlying effects for the J performance measures for the i th study, i S is the within-study variance-covariance matrix for the i th study (assumed known) containing the J variances of the estimates (in the diagonal: ,S , ,S S K ) and their covariances (in the off-diagonal; for example is the within-study covariance for measures 1 and 2, where 1 is their within-study correlation caused by estimates derived from the same patients), µ contains the J means for the measures of interest, and Σ is the between-study variance-covariance matrix containing the J between-study variances (in the diagonal: ,τ , ,τ τ K ) and their between-study covariances (in the off-diagonal; e.g. the between-study covariance for measures 1 and 2 is , where is their between-study correlation induced by differences in study populations and settings).The number of rows in each vector is equal to the number of measures.In its simplest form with two measures of interest (e.g.C-statistic and calibration slope), equation (3) can be expressed as a bivariate meta-analysis (Appendix).REML can again be used for estimation, although other options are available [30,31].
Multivariate extensions to ‫ܫ‬ ଶ can also be calculated [26,31], giving the fraction of the total variability due to between-study variability for each performance statistic ‫ܫ(‬ ଶ ).

Making joint inferences across multiple performance measures
After equation ( 3) is estimated, marginal confidence and prediction intervals for each performance measure can be obtained using the formulae given in the univariate section.
However, by accounting for their correlation, the multivariate approach also enables joint inferences.For instance, extending equation (2) to a bivariate t-distribution with k -2 degrees of freedom, one can obtain a joint 95% prediction region for two performance measures of interest (e.g. the C-statistic and the calibration slope) in a new population.Joint probabilistic inferences can also be made if we assume the multivariate t-distribution is an approximate posterior distribution (that is, we assume it is obtained from a Bayesian analysis with uninformative priors, and give it means, variances and covariances obtained from REML  3) -see Supplementary material 1 for full details [32]).For example, one can derive the joint probability that the C-statistic will be above 0.7 and the calibration slope will be between 0.9 and 1.1 in a new population.A fully Bayesian approach can also be used to derive such posterior inferences by formally specifying prior distributions and combining them with the likelihood, then using, for example, Gibbs sampling to take samples from the exact posterior distributions.Riley et al. [33] describe the Bayesian approach to multivariate meta-analysis with IPD.

Comparing the predictive performance of different implementation strategies
When applying a prediction model to a new population, different implementation strategies might be used regarding the choice of model intercept (baseline hazard) (this is illustrated in Sections 3.1 and 3.2, and for example includes recalibration).Meta-analysis of performance statistics allows such implementation strategies to be formally compared.The aim is to identify an implementation strategy that, for each performance measure, has excellent performance on average (indicated by ߤ̂); small values of between-study heterogeneity (indicated by ߬ and/or ‫ܫ‬ ଶ ); and a narrow prediction interval that suggests consistently good performance in new populations.Multivariate meta-analysis even allows the competing strategies to be ranked according to their overall performance: for example, according to the joint probability that, in a new population, the C-statistic will be above 0.7 and the calibration slope will be between 0.9 and 1.1.The strategy with the largest probability will be ranked first.

Meta-regression and examining covariates
Meta-analysis equation ( 3) can be extended to a multivariate meta-regression that includes study-level covariates to explain between-study heterogeneity, such as treatment policies, population characteristics (e.g.mean age), year of investigation, and length of follow-up.
Competing implementation strategies can then be evaluated and compared for specific subgroups of studies (e.g.those done within the last few years, those with consistent treatment policies, those with the same case-mix, etc).This may help identify populations where model performance is satisfactory and others where it is inadequate, to inform the model's generalisability and applicability [11].A nice example of a meta-regression to examine the impact of case-mix variation on model performance is given by Pennells et al. [15], who identify that studies with a higher standard deviation of age are strongly associated 11 with a higher C-statistic and D-statistic.Model performance can also be examined for patientlevel covariates; for example, discrimination and calibration could be estimated for males and females separately.Equation (3) can then be applied to summarise each subgroup, or even the difference between subgroups.

Applied examples
We now illustrate the proposed meta-analysis methods with two applied prediction model examples, one for diagnosis and one for prognosis, and compare the performance of different implementation strategies, including recalibration

Data, model development and competing implementation strategies
We used IPD from 12 studies to develop a diagnostic prediction model for the risk of having deep vein thrombosis (DVT) in patients that were suspected of having DVT, as described previously [34].A total of 10002 patients were available across the 12 studies (with study sample sizes ranging from 153 to 1768 patients), and 1864 (19%) patients truly had DVT.
This IPD is used here only for illustration purposes and not to develop or recommend the optimal diagnostic model to be used in medical practice.
The prediction model was developed using logistic regression, including a separate intercept for each study and three predictors chosen a priori: sex (male=1, female=0), surgery (recent surgery or bedridden = 1, no recent surgery or bedridden = 0) and calf difference (≥3cm = 1, <3cm = 0).However, three different implementation strategies were considered (for the model intercept) when applying the developed model to the external validation dataset: This is a form of model recalibration [35].

M A N U S C R I P T A C C E P T E D
Internal-external cross-validation was undertaken for each implementation strategy, and their predictive performance then summarised and compared across the 12 external validation studies using our multivariate meta-analysis approach.

Results
Regardless of which study was excluded, the predictor effect estimates (log odds ratios) were very similar in each cycle of the internal-external cross-validation approach (supplementary material 2 shows the parameter estimates in each cycle, and the intercept to be implemented in strategy ( 3 There is a perfect negative correlation between log(expected/observed) and calibration-inthe-large by definition.
The multivariate meta-analysis results for each statistic are shown in Table 1.The metaanalysis results for the C-statistic are practically the same in all implementation strategies, as are those for the calibration slope.The mean C-statistic is 0.69 (95% CI: 0.67 to 0.71), indicating moderate discrimination.There is a small amount of between-study heterogeneity (τ≈0.02;I 2 ≈37%), leading to a 95% prediction interval of 0.64 to 0.73, revealing fairly consistent discrimination performance across studies (Figure 1).The mean calibration slope is around 0.98 (95% CI: 0.85 to 1.10), which is close to the ideal value of one though indicating very slight over-prediction.The amount of between-study heterogeneity is large (߬≈0.16;I 2 ≈59%), leading to a wide 95% prediction interval (e.g.0.59 to 1.38 for strategy (2)).This contains values well above and well below one, which respectively suggest that in 13 some populations the predicted probabilities vary too little (i.e. the model is under-fitted and/or assigns probabilities that are too similar across individuals) and in others they vary too much (i.e. the model is over-fitted to the development sample and assigns probabilities that vary too much across individuals).This illustrates how the average performance is an incomplete picture; calibration slope is good on average, but could be poor in particular populations (Figure 2).
Calibration-in-the-large does differ more importantly across implementation strategies (Table 1), as it is sensitive to the choice of intercept.The meta-analysis results reveal it is, on average, slightly worse for strategy (1) as there is a small over-prediction in the proportion with DVT (-0.13, 95% CI: -0.19 to -0.08).However, there is almost no heterogeneity in the calibration-in-the-large (τ=0.008;I 2 =1%), leading to a narrow 95% prediction interval (-0.20 to -0.07).Using strategy (2) or (3) the average calibration-in-the-large is closer to zero (-0.004 and 0.047 respectively), but comes at the expense of slightly larger between-study heterogeneity (τ=0.53 and 0.27, I 2 =97% and 89% respectively), leading to wider prediction intervals.For example, for strategy (2) the 95% prediction interval is -1.24 to 1.23.
Instead of calibration-in-the-large, it is perhaps easier to interpret the Expected/Observed proportion of DVT cases (Table 1).This follows a similar pattern (Table 1), with narrowest prediction interval for strategy (1) and slightly improved average performance for strategies (2) and (3).The 95% prediction interval for Expected/Observed for strategy (1) suggests the overall agreement is likely to be reasonable in new populations (1.05 to 1.14), with the number of DVT cases over-predicted by between 5% and 14%.However, the 95% prediction interval is unsatisfactory for the other strategies; for example, it is 0.41 to 2.54 for strategy (2) indicating the number of predicted DVT cases in a new population could range from 59% too few up to 154% too many.
Overall, therefore, strategy (1) appears best as it removes heterogeneity in the calibration-inthe-large and Expected/Observed, whilst maintaining similar discrimination.However, the prediction model would benefit from additional predictors, as currently discrimination is only moderate and there is large heterogeneity in calibration slope.This is confirmed by a joint probability of only 0.03 that strategy (1) will give a C-statistic ≥ 0.7 and a calibration slope between 0.9 and 1.1 in a new population (Table 2).If the criteria for model discrimination is relaxed to a C-statistic ≥ 0.65, then the joint probability improves but only to 0.43.

M A N U S C R I P T A C C E P T E D
ACCEPTED MANUSCRIPT 14

Data, model development, and competing implementation strategies
We used IPD from eight cohort studies (relating to eight different countries from Look et al. [36]) to develop and evaluate a prognostic prediction model for the risk of mortality over time in women recently diagnosed with breast cancer.In total there were 7435 patients (ranging from 69 patients to 3242 per study) and 2043 events.The maximum follow-up duration was 120 months and the median follow-up duration across all studies was 86.3 months.Internal-external cross-validation was used and, in each cycle, a Royston-Parmar flexible parametric survival model was fitted [37][38][39], with the baseline cumulative hazard function modelled using restricted cubic splines (with four knots deemed sufficient) and predictor effects (hazard ratios) assumed constant over time.A set of eight candidate predictors was considered at each cycle: age, tumour type, tumour grade, tumour size, number of positive nodes, menopausal status, adjuvant therapy, and hormone receptor status.
Backwards selection was used, with p>0.05 taken for exclusion.Separate but proportional baseline hazard functions were included for each country; that is, one study was taken as the reference group, and others were allowed a country-specific adjustment factor).When Internal-external cross-validation was undertaken for each strategy, and their predictive performance then summarised and compared across the eight external validation studies using meta-analysis.

M A N U S C R I P T A C C E P T E D ACCEPTED MANUSCRIPT
15

Results
The predictor effect estimates (log hazard ratios) were similar in each cycle of the internalexternal cross-validation approach (results available on request).The backwards selection retained all candidate predictors in each cycle, apart from menopausal status which was always excluded.For each implementation strategy, we evaluated model performance in each external validation study by estimating Harrell's C-statistic [20], the D-statistic [22,40], and the calibration slope between the predicted hazard function and the observed hazard function, as defined in the Appendix.The estimates, with their variances and within-study correlation, are shown in supplementary material 4. Within-study correlations were all positive, and generally moderate to large.
Multivariate meta-analysis of the validation statistics is summarised in Table 3 for each implementation strategy.The summary C-statistic and D-statistic results are barely affected by the choice of strategy.The average C-statistic is 0.71 and its 95% prediction interval is 0.66 to 0.76, suggesting consistently moderate discrimination across populations.The average D-statistic is about 0.33, which equates to a moderate hazard ratio of 1.39 (95% CI: 1.23 to 1.57) between two equal sized groups across the prognostic index.However, D is inconsistent across populations (I 2 is about 87%), and thus its prediction interval is wide (Table 3).
Calibration slope is affected by the choice of strategy.For strategy (1), which allows recalibration in the validation study, the calibration slope is excellent.The meta-analysis gives an average calibration slope of 1.003, with only moderate heterogeneity (I 2 = 35%) leading to a narrow prediction interval of 0.93 to 1.08.In contrast, strategies (2) and (3) perform poorly.Although average calibration is excellent, there is large between-study heterogeneity (e.g.I 2 = 99% for strategy (3)) leading to wide predictions intervals (e.g.0.15 to 1.77 for strategy (3)).This again reveals how average performance is an incomplete and potentially misleading summary of performance.
Figure 3 shows joint prediction ellipses for the C-statistic and the calibration slope, derived using the multivariate meta-analysis results for each strategy.For implementation strategy (1) there is a joint probability of 0.67 for a C-statistic ≥ 0.7 and a calibration slope between 0.9 and 1.1; however, the probability is only 0.15 for strategy (3) and 0.22 for strategy (2).

M A N U S C R I P T A C C E P T E D
Strategy (1) thus performs best, but it requires recalibration of the model in new countries and may be difficult to implement.We therefore sought to improve strategy (2), which does not include recalibration, by identifying the cause of heterogeneity in its calibration performance.
It was observed that Study 3 gave the poorest calibration slope upon external validation of the models (Table 4), most likely due to the baseline hazard in Study 3 being different in shape (non-proportional) to those other studies.Extending equation ( 3) to a multivariate metaregression with a covariate for country (1= Study 3, 0 = otherwise) explained a large part of the heterogeneity (p < 0.001).We repeated the internal-external cross-validation approach for strategy (2), but omitted Study 3 for the entire process.External validation performance was improved, as heterogeneity in calibration slope was reduced (τ̂ = 0.156 excluding Study 3, τ̂ = 0.22 including Study 3), and thus its 95% prediction interval was narrower (supplementary material 5).The joint probability for a C-statistic≥0.7 and a calibration slope between 0.9 and 1.1 was improved to 0.32, but still considerably worse than strategy (1), indicating recalibration remains preferable.

Discussion
We have proposed a multivariate meta-analysis approach for summarising and comparing prediction model performance across multiple external validation studies using IPD.This can be used within internal-external cross-validation to also incorporate a model development phase, or when IPD from multiple studies are available for external validation of existing models.Each of the statistical methods involved (such as obtaining within-study correlations and fitting the multivariate equation) only take up to a few minutes to perform using computer software such as STATA, and provide results that improve the interrogation of a prediction model's performance and its implementation strategy.
Currently, most external validation research is undertaken using a single dataset.However, multivariate meta-analysis of IPD is a novel way to examine the overall performance and generalisability of a prediction model across multiple datasets [13,15,16].A good model will have satisfactory performance on average across all external validation datasets.But ideally there should also be little or no between-study heterogeneity in performance.Our examples showed that a prediction model may have excellent average performance, but may not have consistent performance across datasets.Such heterogeneity is rarely considered in external validation research, but should be routinely examined where possible, in particular to identify

M A N U S C R I P T A C C E P T E D
the best implementation strategy.In our examples, the investigation of heterogeneity revealed that recalibration of the intercept term to the validation population was essential, otherwise there was considerable inconsistency in calibration performance of our prediction models.
The importance of intercept recalibration is also shown elsewhere [41,42].However, it may not entirely remove the issue of miscalibration, as seen in the DVT example where there remained slight over-prediction even after recalibration.In particular, if there is also heterogeneity in predictor effects then one may also need to recalibrate these to the intended population; however, this defeats the purpose of the initial research (i.e. to develop a prediction model that can be used widely and easily), and rather indicates that additional and/or more homogenous predictors are required.
Heterogeneity in discrimination performance was also observed in our examples.This may also be due to heterogeneity in predictor effects across populations and/or different case-mix distributions across populations, as populations with wider ranges of continuous predictors often have better discrimination [15].For such reasons, incorporating matched case-control studies alongside cohort studies may increase heterogeneity in discrimination performance, as the former typically have narrower ranges of predictors [43].Another potential cause of heterogeneity in performance of a prognostic prediction model is follow-up time, and also heavy censoring may bias Harrell's C-statistic, prompting Gönen and Heller to propose an alternative [44].Such factors may also impact the magnitude of between-study correlation in the performance measures.
As external validation of a prediction model usually requires multiple statistical measures of performance, in particular at least one for calibration and one for discrimination [8,19], our multivariate meta-analysis approach jointly synthesises all measures together across multiple validation studies.This accounts for their within-study and between-study correlation [45], which may arise because measures are highly related [33].For example, the C-statistic and D-statistic typically have moderate to large positive within-study correlation (as seen in supplementary material 4(b)) as they are both measures of discrimination and within-studies are estimated on the same patients.Similarly the calibration slope and C-statistic may also be correlated between-studies, for instance if the between-study heterogeneity in predictor effects causes calibration slope to become greater than 1 as discrimination improves, but less than 1 as discrimination worsens.Accounting for such correlation in the meta-analysis allows the borrowing of strength across performance measures to potentially reduce bias and improve precision [46,47].Further, it is crucial to account for correlation when computing joint probabilities of model performance, such as the magnitude of the C-statistic and calibration slope, as otherwise inferences may be misleading [45].
Our intention was to illustrate how the multivariate meta-analysis approach allows researchers to summarise both discrimination and calibration.We focused on well-known statistical criteria, such as the calibration slope, (Harrell's) C-statistic, and Royston and Sauerbrei's D-statistic.However, we recognise that the criteria for a 'good' prediction model is open to much debate [48], and readers may prefer to meta-analyse other statistical measures available, including alternatives to Harrell's C-statistic [44].Clinical criteria may also be preferred [49], to focus more on the consequences for decision-making [50].Whatever criteria are used, we recommend they are pre-specified in a published protocol [51].Visual plots of calibration [23] and discrimination [52] are also important, as neatly illustrated by Royston et al. [17].Calibration estimates can also be obtained (and then meta-analysed) for particular subgroups within studies, for example defined by particular patient characteristics or categories of the prognostic index [23].Also, we note that excellent validation performance is not the end of the story: a prediction model's impact on patient outcomes also needs to be evaluated, for example in subsequent trials [3].
A potential limitation of our work is the multivariate normality assumption for the distribution of true performance across studies.Though this is a common assumption in the meta-analysis field, prediction intervals and regions are potentially vulnerable to departures from this [53].A related issue is the choice of scale to use for the estimates of validation performance [16], and further research is needed on this.Internal-external cross-validation is also limited if the number of studies are small, and researchers should ensure the number of events is suitable in each cycle [54,55].
In conclusion, we propose multivariate meta-analysis for external validation of the performance and implementation of a prediction model when IPD are available for multiple studies.The approach encourages researchers to focus not only on average performance, but also on the consistency in performance across populations, for both calibration and discrimination.* A trivariate meta-analysis was fitted to calibration-in-the-large, calibration slope and C-statistic, and then again for log(Expected/Observed), calibration slope, and C-statistic.Perfect negative correlation between calibration-in-the-large and expected/observed within studies prevents all four measures being analysed together (due to collinearity).Results were practically the same for calibration slope and C-statistic, regardless of the trivariate model fitted.* defined by a C-statistic≥0.7 and an calibration slope between 0.9 and 1.1

M A N U S C R I P T A C C E P T E D ACCEPTED MANUSCRIPT
Appendix: Brief explanation of key statistical concepts considered in the article

C-statistic
A measure of a prediction model's discrimination (separation) between those with and without an event (outcome), which can be calculated for either binary or survival outcome data [20,21].Also known as the concordance index or, for binary outcomes, the area under the receiver operating characteristic (ROC) curve.It gives the probability that for any randomly selected pair of individuals, one with and one without the event (outcome), the model assigns a higher probability to the individual with the event (outcome).A value of 1 indicates the model has perfect discrimination, whilst a value of 0.5 indicates the model discriminates no better than chance.For the DVT model we used the 'roctab' module in STATA to calculate the C-statistic, whilst for the breast cancer model we used 'stcstat2' to calculate Harrell's C-statistic.

D-statistic
A measure of discrimination for time-to-event outcomes [22].This can be interpreted as the log hazard ratio comparing two equally sized groups defined by dichotomising at the median value of the prognostic index from the developed model (where the prognostic index is defined by the combined predictor effects in the developed model, i.e. beta1*X1 + beta2*X2 + ...).Higher values for the D-statistic indicate greater discrimination, and an increase of 0.1 over other risk scores is suggested to be a good indicator of improved prognostic separation [10].For the breast cancer model, the D-statistic in the external validation study was calculated using the 'str2d' module [40] in STATA after fitting the Royston-Parmar model in our development data using the 'stpm2' module [38].

Calibration slope
This is a measure of agreement between observed and predicted risk of the event (outcome) across the whole range of predicted values [1,20].For example, if a prediction model is developed using logistic regression it is of the form: is used to calculate the calibration slope (delta1), which should ideally be 1 (though a value of 1 does not by itself confirm calibration is perfect [48]).
In the breast cancer example in our article, we used a flexible parametric survival model to develop the prediction model, using the Royston-Parmar approach where the log cumulative hazard function is modelled using restricted cubic splines [37][38][39], with four knots chosen.To

Calibration-in-the-large
This is an overall measure of calibration, [1] defined by fitting in the validation dataset logit(p) = alpha0 + offset where alpha0 is the calibration-in-the-large and the offset term is equal to logit(P-pred)

Expected/Observed number of events (E/O)
This is another measure of calibration, closely related to the calibration-in-the-large, but more intuitive to interpret.It provides the ratio of the total expected events (outcomes) to the total observed events (outcomes).It can be obtained by summing all predicted probabilities in the validation dataset and dividing by the number of observed events.An ideal value is 1.Values less than 1 indicate the model is under-predicting the total number of events in the population, whilst values above 1 indicate it is over-predicting the total events in the population.

Bivariate model
When there are two performance measures of interest (e.g.C-statistic and calibration slope), the general multivariate meta-analysis of equation ( 3) can be simplified to the following bivariate meta-analysis: Supplementary material 2: Model parameter estimates for the fitted DVT model to be implemented using strategy (3), which fits a logistic regression model in each cycle of the internal-external cross-validation approach to obtain predictor effects and study-specific intercepts.
1 ) (β 2 ) (β 3 ) Note: Bold numbers represent the intercept used for external validation in the excluded study for strategy (3), where the intercept from the study with the closest prevalence was selected.

M A N U S C R I P T A C C E P T E D ACCEPTED MANUSCRIPT
Supplementary material 3(a): Estimates (standard errors) of the calibration and discrimination performance of the DVT model developed in each cycle of the internal-external cross-validation approach, for the three implementation strategies    The C-statistic and D-statistic only depend on the prognostic index (see Appendix).As the prognostic index (beta terms) from the developed model is not dependant on the implementation strategy, the C-statistic and D-statistic estimates are identical regardless of the implementation strategy used. 2 obtained as defined in the Appendix.

Strategy ( 2 )
: Use the estimated weighted average of the study intercept terms from the developed model.Strategy (3): Use the estimated intercept for one of the studies in the developed model that had the most similar prevalence of DVT to the external validation study.
)).During external validation of the model, for each implementation strategy four validation statistics were estimated: calibration-in-the-large, calibration slope, the Cstatistic, and the ratio of expected and observed DVT cases, as defined in the Appendix.These results are shown (with standard errors) in supplementary material 3(a) for each of the strategies.Their within-study correlations, obtained from bootstrapping with 1000 samples, are shown in supplementary material 3(b).These are large (between +0.90 to +0.98) for the calibration slope and C-statistic, indicating a strong positive relationship between them.In other words, as the observed calibration slope of model predictions decreases (becomes flatter) the observed discrimination of the model predictions also decreases (less separation); conversely, when model predictions produce a steeper observed calibration slope the discrimination is improved.The other measures of calibration (calibration-in-the-large and expected/observed) measure overall agreement, and thus are not affected so much by changes in discrimination; thus their within-study correlation with the C-statistic is close to zero.
applying the developed model to the external validation study, three different implementation strategies were considered (in regard the baseline hazard): Strategy (1): Use a new country-specific adjustment factor as estimated in the validation study itself.This is a form of recalibration, but assumes the baseline hazard in the validation study and the development studies are proportional.Strategy (2): Use a weighted average of the estimated country-specific adjustment factors from the developed model.Strategy (3): Use the country-specific adjustment factor for a country that was included in the developed model and is closest geographically to the validation country.
logit(p) = alpha + beta1*X1 + beta2*X2 + ... Then the predicted probability (P-pred) is derived using P-pred = exp(alpha + beta1*X1 + beta2*X2 + ...) / [1 + exp(alpha + beta1*X1 + beta2*X2 + ...)] data, fitting the model logit(p) = delta0 + delta1*(logit(P-pred)) estimate calibration slope of the developed model, we fitted the following model in each external validation study:lnH(t) = gamma0+gamma1*z1+gamma2*z2+gamma3*z3 +b*Xwhere b is the calibration slope, lnH(t) is the log cumulative hazard function over time, t, and the gamma and z terms define the knots and the baseline lnH(t).The (gamma1*z1+gamma2*z2+gamma3*z3) value was forced to be that from the developed model; the X value for each patient corresponds to their value of the prognostic index from the developed model (i.e.Beta1*X1 + Beta2*X2 + ...); and the gamma0 value is dependent on the implementation strategy used.In strategy (1) gamma0 is newly estimated in the external validation study (akin to recalibration).In strategy (2) gamma0 is forced to be a weighted (meta-analysis) average of all the study-specific gamma0 estimates from the developed model.In strategy (3) the gamma0 value is forced to be the gamma0 estimate for the nearest country available in the developed model.
-parametric bootstrapping to obtain variance-covariance matrix of performance estimates Consider a single external validation study.Bootstrapping uses the IPD from this study and randomly selects one patient with replacement, then randomly selects a second patient with replacement, and repeats until the same sample size is obtained as in the original study.This process is repeated b times, so that b bootstrap samples are obtained.Then, in each of the bootstrap samples the performance statistics of the developed model are estimated (e.g.Y i1 could be the estimated C-statistic and Y i2 could be the estimated calibration slope estimate in study i).This produces b values for each performance statistic.The observed variance of the b values for each statistic estimates their variance (i.e. it gives S i1 2 , the variance of Y i1 , etc).Similarly, the observed correlation across the b samples for each pair of validation statistics estimates their within-study correlation (i.e. it gives within-study correlation between Y i1 and Y i2 etc).
Estimates (standard errors) of calibration and discrimination performance for the breast cancer model in each cycle of the internal-external cross-validation approach, for the three implementations strategies

Table 1 :
Trivariate meta-analysis results* for the calibration and discrimination performance of the DVT model for each implementation strategy.

Table 2 :
Joint predicted probability of 'good' discrimination and calibration performance for each of the three DVT models, derived using the multivariate meta-analysis results for the Cstatistic and calibration slope shown in Table1.

Table 3 :
Trivariate random-effects meta-analysis results for calibration and discrimination performance of the breast cancer model for each implementation strategy Note: CITL refers to calibration-in-the-large and log(E/O) refers to log of the Expected/Observed number of events.Within-study correlations (ρ Wi ), obtained through bootstrapping, between performance statistics estimated for the DVT model in each cycle of the internal-external cross-validation approach, for the three implementation strategies CITL refers to calibration-in-the-large and log(E/O) refers to log of the Expected/Observed number of events.