Internal-external cross-validation helped to evaluate the generalizability of prediction models in large clustered datasets

Objective: To illustrate how to evaluate the need of complex strategies for developing generalizable prediction models in large clustered datasets. Study Design and Setting: We developed eight Cox regression models to estimate the risk of heart failure using a large population-level dataset. These models differed in the number of predictors, the functional form of the predictor effects (non-linear effects and interaction) and the estimation method (maximum likelihood and penalization). Internal-external cross-validation was used to evaluate the models’ generalizability across the included general practices. Results: Among 871,687 individuals from 225 general practices, 43,987 (5.5%) developed heart failure during a median follow-up time of 5.8 years. For discrimination, the simplest prediction model yielded a good concordance statistic, which was not much improved by adopting complex strategies. Between-practice heterogeneity in discrimination was similar in all models. For calibration, the simplest model performed satisfactorily. Although accounting for non-linear effects and interaction slightly improved the calibration slope, it also led to more heterogeneity in the observed/expected ratio. Similar results were found in a second case study involving patients with stroke. Conclusion: In large clustered datasets, prediction model studies may adopt internal-external cross-validation to evaluate the generalizability of competing models, and to identify promising modelling strategies.


Introduction
In medicine, there are an increasing number of clinical prediction models [1] . These models aim to predict a risk of having a certain condition or experiencing a health event in the future. Prediction models are often developed using a single and small dataset. This leads to prediction models that are more prone to overfitting with the dataset used for its development, which leads to poor accuracy and less generalizability of risk predictions when the model is validated or used in new individuals.
For this reason, there has been a growing interest in prediction model studies using large datasets from electronic health records (EHRs), multi-center studies or individual participant data [2][3][4][5] . An advantage of such large datasets is that parameters of the prediction model can accurately be estimated, thereby facilitating the development of complex models with many predictors, interaction terms and/or non-linear effects. Furthermore, a common feature of these large datasets is that individuals are often clustered within hospitals, primary care practices, or even within countries. Clusters may differ with respect to included participants, variable definitions and measurement methods, all of which may affect the generalizability of developed prediction models. The presence of clustering, however, also offers an important opportunity, as the performance of a prediction model can be examined on mul-tiple occasions and thus be used to explore its generalizability across different settings and populations. Recently, various strategies for such analyses using large clustered data have been proposed [2 , 5-8] .
The aim of this study was to illustrate how advanced methods can be used to evaluate the need of complex strategies for developing generalizable clinical prediction models in large clustered datasets.

Methods
For illustration purpose, we used two case studies.

Case study 1
We compared various modelling strategies using an example of a prediction model for the incidence of heart failure (HF). In the field of cardiovascular diseases (CVD), HF is one of the most relevant outcomes due to its high morbidity and mortality [9][10][11][12] .

Source of the data
We used an existing large population-level dataset which links three sources of EHRs in England: primary care records from the clinical practice research datalink (CPRD), secondary care diagnoses and procedures recorded during admissions in hospital episodes statistics (HES), and the cause-specific death registration information sourced from the office for national statistics (ONS) registry. This study was carried out as part of the CALIBER resource ( https://www.ucl.ac.uk/ healthinformatics/ caliber and https:// www.caliberresearch. org/) [13 , 14] . CALIBER, led from the UCL Institute of Health Informatics, is a research resource providing validated EHR phenotyping algorithms and tools for national structured data sources. Data were recorded in five controlled clinical terminologies: Read version 2 (CPRD diagnoses), International classification of diseases (ICD)-9 and ICD-10 (HES diagnoses, ONS causes of death), the Office of Population Censuses and Surveys (OPCS)-4 (HES procedures) and British National Formulary (BNF) (CPRD medication prescriptions). The study was approved by the MHRA (UK) Independent Scientific Advisory Committee (14_246RMnA2), under Section 251 (NHS Social Care Act 2006).

Population
The construction of this cohort has been described by Uijl et al [15] . Briefly, we selected all individuals that were 55 years or older between January 1, 2000 and March 25, 2010, and had at least 1 year of follow-up by a general practitioner, in a practice that had at least 1 year of upto-standard data recording in CPRD. The last date of the follow-up between the period above was considered cohort entry date (index date). Individuals with a history of HF before their index date were excluded. The study flow diagram is shown in Appendix A.

Predictors
We identified predictors that are commonly measured in CPRD or HES, and commonly used for prediction of HF [15 , 16] : age, sex, current smoking, ethnicity (CE, Caucasian ethnicity), index of multiple deprivation (IMD), body mass index (BMI), creatinine level (CL), and total cholesterol (TC). IMD is a measure of multiple deprivation at the small area level, consisting of seven domains [17] . Within this set, we selected those predictors which were least affected by missing data. The closest measurement to index date between 3 years before and 1 year after the index date was used. Detailed information about the definition of each predictor is available on the CALIBER website [18] .

Outcomes
The primary outcome was incidence of HF, based on the first record of HF from CPRD or HES after the index date. In CPRD, HF was defined by a diagnosis of HF or chronic left ventricular dysfunction on echocardiogram with READ codes. In HES, it was defined by a diagnosis of HF during a hospitalization using all positions of ICD-10. If no diagnosis of HF was made, censoring was defined as the first event among the following: death, deregistration from a practice, last practice data collection, or at the study end date.
2.1.5. Statistical analysis 2.1.5.1. Multilevel imputation. Multiple multilevel imputation, which accounts for potential heterogeneity between the included clusters, is recommended in the recent methodological guidelines [19,34] . However, due to limited hardware processing capacity, we applied single multilevel imputation. Details of the imputation process are described in Appendix B.

Derivation and validation of prediction models.
We considered eight modelling strategies to predict the risk of developing HF using Cox regression. These models differed with respect to the number of predictors, the functional form of the predictor effects and the method of estimation. Each model and their estimation method are summarized in Table 1 . Models 1, 3, 5 and 7 included all predictor variables as linear effects. Models 2, 4, 6 and 8 used RCS with three knots for all continuous predictor variables, and interaction terms between all possible combinations of two variables. Model 3, 4, 7, and 8 were estimated using a ridge penalty. For all models, the total number of regression coefficients is displayed.
Model 1 included four predictors (age, sex, current smoking, and CE) as linear effects. Model 2 was an extension of Model 1 that included non-linear effect for age and for all possible two-way interactions between the four predictors. Model 3 and 4 included the same predictors as Model 1 and 2, respectively, but were estimated using a ridge penalty. Model 5 was an extension of Model 1 that also included IMD, BMI, CL and TC as linear effects. Model 6 -8 were extended from Model 5 as similar to Model 2 -4 from Model 1. In models with a ridge penalty (Model 3, 4, 7 and 8), all regression coefficients were shrunk towards zero by penalizing the partial loglikelihood for the magnitude of the squared coefficients (L2-norm) [20] . This strategy has been recommended to avoid overfitting, and to improve prediction model performance, particularly when it is applied in new population. We used the degree of penalty (lambda) which minimized the mean square error in ten-fold cross validation. The proportional hazards assumption of all models was checked using the Schoenfeld residuals.
We performed internal-external cross-validation (IECV) to compare the performance of the aforementioned eight prediction models at multiple occasions [2 , 6] . In contrast to traditional internal validation methods (e.g., bootstrapping, cross-validation) which evaluate the model's performance in new individuals from the same population (i.e., reproducibility), IECV assesses model performance in new individuals from different but related practices as compared to the original development sample. These practices (i.e., taken as cluster) may differ with respect to case-mix, variable definitions and measurement methods, and thus allow to investigate the model's generalizability [21] . Using IECV, the data from all but one practice are used for estimating the prediction model, after which its performance is evaluated in the remaining practice. The procedure is repeated by rotating the omitted practice, resulting in multiple estimates of prediction model performance. For each prediction model, we assessed the model's discrimination performance using Harrell's concordance (c-) statistic. For calibration, we constructed calibration plots in the overall population. We also estimated the calibration slope and the ratio of observed versus expected events (O:E ratio) at 5 years of follow-up [22] . Interpretation of each performance measure is described in Appendix C.
The performance measures resulting from IECV were pooled using random-effect meta-analysis [2 , 23 , 24] . This approach not only accounts for the precision of practicespecific performance estimates, but also quantifies the between-practice variability (heterogeneity) of model performance. Heterogeneity is quantified by the betweenpractice standard deviation of model performance ( τ ) [7] . Meta-analysis results were reported as point estimates with 95% confidence intervals (CI) and 95% prediction intervals (PI). The CI indicates the precision of the model's average performance across all practices. Conversely, the PI accounts for heterogeneity between practices and therefore indicates what performance can be expected when the model is applied within a specific practice.
Abbreviations: IT, interaction terms; #RC, the total number of regression coefficients; CE, Caucasian ethnicity; IMD, index of multiple deprivation; BMI, body mass index; CL, creatinine level; TC, total cholesterol; L, Linear effects; RCS, restricted cubic splines.

Case study 2
In this case study, we used patient-level data from a large international, multi-center, randomized controlled trial [25] . Because the missingness proportion was very low (6.0%), we performed a complete case analysis. Eight modelling strategies using ridge penalized Cox regression model were considered to predict the risk of mortality from CVD in patients with acute ischemic stroke. These models differed with respect to the number of predictors, the functional form of the predictor effects (non-linear effects and/or interaction terms). We illustrated the advantage of IECV by comparing it with bootstrap internal validation. More detailed information is available in Appendix D.
All analyses were performed using R version 3.6.1.

Case study 1
The cohort included 871,687 individuals from 225 general practices. Among these, 43,987 (5.5%) developed HF during a median follow-up time of 5.8 years (interquartile range [IQR] 2.7 -9.9), with a median time-to-event of 3.7 years (IQR 1.8 -6.4). Baseline characteristics are shown in Table 2 .
The number of patients with HF in each general practice was a median of 197 (IQR 128 -282, range 3 -622). We explored heterogeneity of case-mix across the included general practices by comparing their distribution of predicted risk according to Model 5. Results in Appendix E indicate that the standard deviation (SD) of the linear predictor (LP) in each general practice ranged between 1.09 and 1.41, and that the mean LP in each general practice ranged between -0.51 and 0.61.
The estimated regression coefficients of the eight prediction models, as obtained from the entire dataset, are presented in Appendix F. These results indicate that all included predictors were significantly associated with HF, and that interactions were present between various predictors. The performance of the estimated models, as evaluated using IECV, is summarized in Table 3 .
For all models, summary estimates were obtained using random effects meta-analysis. The SE and betweenstudy heterogeneity ( τ ) are given on the scale of the metaanalysis (that is, the logit of c-statistic, the log of the O:E ratio and identity for the calibration slope).

Discrimination performance
The c-statistic across the general practices is shown in Appendix G. All models showed similar discrimination, al- though models that included more predictors yielded somewhat larger values for the c-statistic (0.79 in Model 1 -4 vs. 0.81 in . For all models, there was notable between-practice heterogeneity in discrimination performance. For instance, the 95% PI for a Cox regression model including eight predictors as main effects (model 5) ranged from 0.756 to 0.852. Estimates for the betweenstudy standard deviation ( τ ) were similar for all models, but slightly larger for prediction models that included eight predictors and allowed for non-linear effects and interactions. Fig. 1 indicate that predicted and observed risks were almost in perfect agreement for the unpenalized Cox regression model that included non-linear effects and interactions between predictors (Model 2 and 6).

Calibration performance Calibration plot. Calibration plots in
Predicted and observed risks are almost in perfect agreement for the unpenalized Cox regression models that in-cluded non-linear effects and interactions between predictors (Model 2 and 6). Some under-prediction for risk estimates around 10% is observed in the remaining models.
O:E ratio. The O:E ratio across the included general practices is shown in Appendix H. All models yielded summary O:E ratios at 5 years below one, especially those models that included eight predictors (Model 5 -8). In addition, PIs indicate that all prediction models may substantially over-or under-predict the risk of HF when applied to individual patients from a new practice.
Calibration slope. Calibration slope across the included general practices is shown in Appendix I. Unpenalized prediction models yielded pooled calibration slopes most close to one (Model 1, 2, 5, and 6). Prediction models that adopted a ridge penalty yielded calibration slopes that were slightly larger than one, indicating that predicted risks did not vary enough and thus that too much shrinkage may have been applied in the development sample. For all models, the calibration slope was prone to a limited amount of between-practice heterogeneity. For instance, the prediction model that included eight predictors as main effects (model 5) yielded a 95% PI from 0.833 to 1.214. Estimates of between-study variance of the calibration slope were similar for all models.
Case study 2 The detailed results are shown in Appendix D. In short, among 16,280 patients from 14 countries, 2,745 (16.9%) died due to any CVD related conditions. Using bootstrap validation, we found that the c-statistic ranged from 0.65 to 0.70 with good reproducibility, and that models with more predictors discriminated better. However, results of IECV indicate that the inclusion of additional predictors increased the heterogeneity in discrimination performance. Results of both bootstrap validation and IECV also indicate that inclusion of non-linear terms and/or interaction effects) did not improve discrimination performance. In calibration performance, the effect of complex modelling strategies was small in both summary estimates of O:E ratio and calibration slope and their generalizability.

Discussion
We illustrated how evidence synthesis methods can be used to evaluate the need of complex strategies for developing generalizable clinical prediction models in large clustered datasets. To this end, we applied IECV and quantified the model's average performance as well as its variability between clusters. In contrast to traditional internal validation methods, a major advantage of using IECV in large clustered data is that the external validity of prediction models can be assessed on multiple occasions, thereby allowing researchers to explore the generalizability of different modelling strategies directly during the development process.
In the case study 1, we found that adopting complex modelling strategies did not much improve the external validity of developed prediction models for HF. In particular, prediction models that were based on four commonly available variables yielded a c-statistic of 0.79, which is comparable to existing models for HF using even more than 10 predictors including laboratory tests [10 , 11] . Although the inclusion of additional predictors marginally improved the discriminative performance, it also slightly increased the between-practice heterogeneity. When investigating model calibration, we found that all prediction modelling strategies yielded adequate calibration performance on average. However, because of between-practice heterogeneity, local revisions were often deemed necessary. In the case study 2, we also found that complex modelling did not meaningfully improve the generalizability of the prediction models, although the inclusion of additional predictors moderately improved their discrimination performance.
As we found in the case study 1, the incremental value of candidate predictors is often small in prediction model studies for the incidence of CVD [26 , 27] . For instance, systematic reviews have demonstrated a lack of incremental value for cholesterol level [27] , BMI [27] , and even biomarkers (e.g., triglycerides, C-reactive protein) for predicting CVD [26] . For this reason, it may sometimes be more advantageous to consider the inclusion of non-linear effects or interaction terms, rather than adding more predictors. This strategy is common in machine learning, where methods no longer assume additive linear effects and adopt penalization to avoid overfitting. We mimicked the use of flexible modelling strategies by including non-linear effects and non-linear interaction terms. However, this strategy also failed to improve model discrimination. Similar findings also have been reported in prediction model studies for the prognosis of patients with CVD [28 , 29] . For instance, a recent study adopting advanced machine learning algorithms failed to outperform traditional pre-diction models for readmissions in patients with HF, and yielded c-statistics around 0.60 [28] . In another study, discrimination performance to predict all-cause mortality in patients with coronary artery disease marginally increased from 0.793 (Cox regression model with 27 predictors) to 0.797 (random survival forests with 98 predictors) and to 0.801 (elastic net Cox regression model with 586 predictors) [29] . More generally, there is limited evidence that machine learning models can outperform simple prediction models involving additive linear terms, especially when predictions are only based on structured epidemiological data [30] .
The following limitations need to be considered. In the first case study, the substantial presence of missing data was an important concern. Although we focused on the inclusion of variables with relatively few missing values, some were missing for more than 70% of participants. Multiple imputation is generally recommended to obtain reliable standard errors of the performance measures but only single imputation was pursued due to limited hardware processing capacity. There is still limited guidance on how to implement multiple imputation when developing and validating a prediction model in large clustered datasets. Key issues that remain unclear are (i) how to combine multiple imputation with sampling procedures (e.g., IECV) [31 , 32] , (ii) the order of pooling estimates (across imputations or across clusters first) [33] . Another limitation was that we were not able to include non-linear and interaction terms in the imputation model due to nonconvergence issues. Therefore, continuous variables were imputed as a linear term and no interaction term was included in imputation models. This strategy may have favored simpler modelling strategies in IECV. For this reason, we implemented those modelling strategies in the case study 2 where the presence of missing data was much less a concern. Here, we found similar findings to those in the case study 1.
Second, eligible individuals in both case studies were enrolled more than ten years ago. It is possible that population characteristics have substantially changed over time, and that complex associations (e.g., non-linear predictor effects or interaction terms) have become more common.
Third, we focused on regression-based methods and did not evaluate other flexible modelling strategies such as neural networks or random forests. It is possible that these strategies could yield more promising results, especially if (interaction between) predictor effects cannot adequately be described using the regression-based methods considered here.

Conclusion
We recommend the use of IECV in large clustered datasets to assess the generalizability of prediction models during their development, and to identify whether complex modelling strategies may offer any advantages. In contrast to traditional internal validation methods, IECV allows to evaluate model performance in non-random hold-out samples with individuals from different settings or populations. In our case studies, we found that accurate prediction does not necessarily require complex modelling strategies, and that the need for local updating may be inevitable regardless of how much data are at hand during the model's development. The views expressed are those of the author(s) and not necessarily those of the NIHR or the Department of Health and Social Care.