Predicting COVID-19 prognosis in the ICU remained challenging: external validation in a multinational regional cohort

Objectives Many prediction models for coronavirus disease 2019 (COVID-19) have been developed. External validation is mandatory before implementation in the intensive care unit (ICU). We selected and validated prognostic models in the Euregio Intensive Care COVID (EICC) cohort. Study Design and Setting In this multinational cohort study, routine data from COVID-19 patients admitted to ICUs within the Euregio Meuse-Rhine were collected from March to August 2020. COVID-19 models were selected based on model type, predictors, outcomes, and reporting. Furthermore, general ICU scores were assessed. Discrimination was assessed by area under the receiver operating characteristic curves (AUCs) and calibration by calibration-in-the-large and calibration plots. A random-effects meta-analysis was used to pool results. Results 551 patients were admitted. Mean age was 65.4 ± 11.2 years, 29% were female, and ICU mortality was 36%. Nine out of 238 published models were externally validated. Pooled AUCs were between 0.53 and 0.70 and calibration-in-the-large between −9% and 6%. Calibration plots showed generally poor but, for the 4C Mortality score and Spanish Society of Infectious Diseases and Clinical Microbiology (SEIMC) score, moderate calibration. Conclusion Of the nine prognostic models that were externally validated in the EICC cohort, only two showed reasonable discrimination and moderate calibration. For future pandemics, better models based on routine data are needed to support admission decision-making.


Introduction
During the coronavirus disease 2019 (COVID- 19) pandemic, many prediction models were developed for diagnostic and prognostic purposes. The accurate prediction was paramount to support clinical decision-making, particularly during the early phase of the pandemic when little was known about the manifestations of the disease caused by the new severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Furthermore, prediction of patient outcomes can improve effective management of bed availability in times of a pandemic where knowledge and capacity are under pressure. This was especially the case in the intensive care unit (ICU), as many patients with severe SARS-CoV-2 infection required organ support there [1,2].
A prediction model needs to meet several criteria to be useful in daily clinical practice. In the third update of the living systematic review by Wynants et al. [3], 238 prediction models for prognosis and diagnosis in COVID-19 have been identified and assessed for risk of bias. The risk of bias of all included models was evaluated as being high or, at best, unclear. For a model to perform well, both discrimination and calibration are important. In addition, model predictors must be routinely available. Furthermore, models need to be applicable to the population and settings requiring prediction, such as prognosis in the ICU, particularly during scarce bed availability. However, external validation of prediction models, which means testing the model in another sample of patients than it has been developed in, is often omitted, particularly in the ICU [4]. External validation is essential to generalize results to future patients and should precede the implementation of models in daily clinical practice [5,6]. Several external validation studies of prediction models for COVID-19 patients have been conducted. However, these studies focused mostly on patients admitted to the hospital ward instead of the ICU [7e9]. There is still a lack of ICU-specific prediction models, and applicability of general models to the ICU population is likely possible for some models only [3,10].
Therefore, we aimed to evaluate the predictive performance of published prediction models by selecting promising prognostic prediction models with clinically available predictors for external validation in our multinational COVID-19 cohort consisting of patients admitted to the ICUs within the Euregio Meuse-Rhine. As the majority of the 238 evaluated prediction models were developed at the beginning of the pandemic, we used data from the first pandemic wave for external validation.

Materials and methods
The paper is reported according to the Transparent reporting of a multivariable prediction model for individual prognosis or diagnosisclustered data reporting guideline [11e14]. Every section of the Materials and Methods is detailed in Appendix A.2.

Model selection
Prognostic prediction models for COVID-19 patients in the ICU were identified and extracted from https://www. covprecise.org/, the international Precise Risk Estimation to optimise COVID-19 Care for Infected or Suspected patients in diverse sEttings (COVID-PRECISE) group, in collaboration with the Cochrane Prognosis Methods Group according to the living systematic review of Wynants et al. (Fig. 1) [3]. Inclusion and exclusion criteria are described in Appendix A.2.1 and the selection process is shown in Fig. 1.

Key findings
Of 238 reviewed prognostic prediction models, nine were externally validated in the ICU.
Only two out of these nine models showed reasonable discrimination and moderate calibration.
What this adds to what was known? External validation of prediction models is often omitted in the ICU.
Despite great efforts have been made to develop prediction models early in the pandemic, their clinical value to support decision-making in the ICU is, overall, poor.

What is the implication and what should change now?
For future pandemics, better prediction models based on routine data are needed to support admission decision-making.

External validation cohort
All patients with polymerase chain reaction and/or chest computed tomography scan confirmed COVID-19 and respiratory failure admitted to ICUs of any of the seven participating Euregio hospitals were consecutively included between March 2, 2020, and August 12, 2020 ( Fig. 2) [17]. Hence, the study sample size was determined pragmatically. An extensive description of our methods and cohort has been described in Appendix A.2.2 and elsewhere [16,18].

Predictors
Using a predefined study protocol [16,18], predictor data up to 24 hours of ICU admission were acquired from electronic medical records and manually or electronically collected depending on the center. The collected variables used as predictors and outcomes are described in A.2.3. and Table A.1 of the Appendix [19]. Unknown, inappropriate, and inapplicable data were considered missing at random since missingness of data were related to other variables in the dataset and unlikely to be related to the true value itself [20e22].

Outcomes
Follow-up ended when patients were either discharged from the ICU or died in the ICU and was determined as ICU discharge or death. Patients whose outcome status after transportation could not be retrieved after recontacting the hospital were censored (Appendix A.2.4). Sensitivity analyses were performed without censored patients.

Description of included prediction models
The study characteristics of included prediction models and risk of bias are described in more detail in Appendix A.2.5 [23e30]. The risk of bias of the individual studies was scored by Wynants et al. [3] using the Prediction model study Risk Of Bias Assessment Tool (PROBAST) [15].

Ethics approval
Ethical approval was obtained from the medical ethics committee (Medisch Ethische Toetsingscommissie 2020-1565/3 00 523) of Maastricht UMCþ. We included all patients in the analyses. In addition, sensitivity analyses were performed without censored transferred patients who, in the main analysis, contribute to the survived group. Missing data were imputed using multiple imputation if !50% of values on a variable were missing. Variables with more missings were omitted from the analysis. The number of imputations was based on the percentage of patients with missing data [31]. Continuous and categorical predictors were appropriately handled using the same definitions and cutoff values as the development study. The prognostic index was calculated for each patient by the sum of the models' regression coefficients, reported in the development studies, multiplied by the individual patient values. The prognostic index was transformed into a probability score when a model intercept was reported. For the Sequential Organ Failure Assessment (SOFA) score and the Acute Physiology And Chronic Health Evaluation II (APACHE II) score, risk scores instead of separate variables were already available for all patients and therefore directly assessed. The performance of the models was assessed by both discrimination and calibration measures. Model discrimination, the ability to separate patients who died in the ICU from those who are discharged, was determined as the area under the receiver operating characteristic (ROC) curve (AUC). An AUC of 0.5 implies inability to distinguish between those who die in the ICU and those who are discharged, whereas one means perfect discrimination. Model calibration refers to the agreement between observed risk and the predicted risk [32,33]. Calibration was assessed by calibration-inthe-large (i.e., the difference between the predicted and observed probability of mortality) and by visual inspection of the calibration plot. Calibration could only be assessed in models that reported an intercept to calculate a probability instead of a unitless risk score only. The cohort was divided into deciles according to the estimated probability score, Fig. 1. Flowchart identifying prediction models. COVID-19, coronavirus disease 2019; ICU, intensive care unit; ARDS, acute respiratory distress syndrome; ASAT, aspartate aminotransferase. Legend: models for diagnosis and identifying people at risk in the general population were excluded. The remaining models were mainly prognostic, and further selection was based on outcome measures. As our cohort was composed of ICU patients only, in whom severe COVID-19 infection can be assumed, the outcome ICU admission, as well as progression to severe COVID-19, severe COVID-19, and ARDS, were excluded. Outcome measures length of hospital stay, in-hospital mortality, and in-hospital or out-of-hospital mortality were used. Since reporting of predictors and coefficients are necessary in order to validate prediction models as specifically assessed in step 4.9 in PRO-BAST [15], a tool to assess the risk of bias and applicability of prediction model studies, models which did not report or probably did not report this, or were machine learning or artificial intelligence studies, were excluded. Finally, predictors included in one of the final 21 prediction models were evaluated. Again, as we only included ICU patients and our goal was to validate models containing routinely available data, models including symptoms not relevant for ICU patients, not routinely available data, or data that were not available in the EICC cohort (e.g., !50% missing data) were excluded. Additionally, two promising models, which were not available in the COVID-PRECISE, were added. Abbreviations: PROBAST, Prediction model study Risk Of Bias Assessment Tool; EICC, Euregio Intensive Care COVID; COVID-PRECISE, Precise Risk Estimation to optimise COVID-19 Care for Infected or Suspected patients in diverse settings. displayed by points in the calibration plot. Perfect calibration is shown by the diagonal reference line, indicating agreement between predicted and observed probabilities over the range of predictions. Dots located above the reference line indicate underestimation by the model, while overestimation is reflected by the points below the reference line. Pooled AUCs and calibration-in-the-large were calculated for the three Euregio country parts using random-effects meta-analysis and 95% confidence intervals (CIs) were computed [12,13].

Model selection
A total of 238 prediction models for COVID-19 were identified by COVID-PRECISE. Firstly, 129 models were excluded because they were diagnostic or not applicable to the ICU population (Fig. 1). Subsequently, 45 models were excluded due to unusable outcome measures such as ICU admission or severe COVID-19 pneumonia. Fortythree models were excluded as full information on predictors, intercepts, and coefficients was not present in the original article or supplement. Of the 21 potential prognostic models, three were not applicable since some predictors were not relevant for the ICU (e.g., cough, fatigue), four models included predictors that were not routinely available in Euregio ICUs (e.g., interleukin 6 or procalcitonin), and seven were excluded because they contained predictors that were more than 50% missing in our cohort. The APACHE II model [26] is widely used in the ICU and was added as prognostic model. The SOFA [30] and Confusion, Urea O7 mmol/l, Respiratory rate !30/min, low systolic (!90 mmHg) or diastolic ( 60 mmHg) Blood pressure, age !65 years (CURB-65) score [29], models that are also broadly implemented, were already included in the models selected via COVID-PRECISE. Furthermore, the Spanish Society of Infectious Diseases and Clinical Microbiology (SEIMC) score [27], which applied to the Euregio Intensive Care COVID (EICC) cohort, but was not available in COVID-PRECISE, was investigated. Thus, nine potential prognostic prediction models were selected for external validation. One model had an unclear risk of bias, five had a high risk of bias, and three models comprised already established prediction scores ( Fig. 1 and Table 1).

External validation cohort
From March 2, 2020, to August 12, 2020, 551 patients with COVID-19 pneumonia were admitted to seven ICUs across the Netherlands, Belgium, and Germany (Fig. 2). Demographic and clinical characteristics and outcome measures are reported in Table 2 for the full EICC cohort and in Table A.2 (Appendix) for the individual country parts. The mean age of the cohort was 65.4 6 11.2 years, the mean body mass index was 29.0 6 5.3 kg/m 2 , and 29% were female. At ICU admission, disease severity, as defined by APACHE II and SOFA scores, was 16.1 6 5.5 and 6.2 6 3.0.

Predictors
In our dataset, 309 (56%) of the patients had at least one missing value on any of the variables from the full set of predictors. Therefore, the number of imputations of the multiple imputation model was set to 56.
Pooled calibration-in-the-large were À2% (95% CI À14% to 10%) for the DL-death model, 6% (95% CI À6% to 18%) for the DCSL-death model, and À5% (95% CI À20% to 11%) for the SEIMC model (Table 3). Fig. 3 shows calibration plots for the DL-death, DCSLdeath, and SEIMC models. Similar results were observed in sensitivity analyses (Table A.3 and Fig. A.1,  Appendix). Minor differences in model discrimination existed between the three Euregio country parts, with the DL-death and DCSL-death having the lowest AUC in the Belgian part, whereas for the clinical model, mechanistic COVID-19 mortality score and SEIMC lowest AUCs were observed in the German part (Table A.4, Appendix). Abbreviations: ICU, intensive care unit; COVID-19, coronavirus disease 2019; RT-PCR, reverse transcription-polymerase chain reaction; CAP, community-acquired pneumonia; CRP, C-reactive protein; SpO 2 , peripheral capillary oxygen saturation; eGFR, estimated glomerular filtration rate; CKD-EPI, Chronic Kidney Disease Epidemiology Collaboration; FiO 2 , fraction of inspired oxygen; PaO 2 /FiO 2 ratio, the ratio of partial pressure of oxygen in arterial blood divided by the fraction of inspired oxygen. a We only included models having mortality as outcome. b One point was scored if systolic blood pressure was !90 mmHg or diastolic blood pressure was 60 mmHg.
Calibration-in-the-large, however, varied largely between the individual countries (Table A.4, Appendix).
Pooled calibration-in-the-large was À9% (95% CI À21% to 3%) for the APACHE II score, and the calibration plot is shown in Fig. 3. Similar model performance was observed in sensitivity analyses (Table A.3 and Fig. A.1,  Appendix). However, the German part had a lower AUC than the Belgian and Dutch Euregio parts, whereas calibration-in-the-large was best in the Belgian part (Table A.4, Appendix).

Discussion
In this study, we reviewed 238 prognostic prediction models for COVID-19 and externally validated nine using routinely available data in a multinational cohort of COVID-19 patients admitted to seven ICUs in Belgium, the Netherlands, and Germany during the first pandemic wave. In addition, established ICU prediction models were added for external validation in COVID-19 patients. Most studied models, among which prediction models for COVID-19 rated as high risk of bias and established ICU scores, revealed poor performance regarding both discrimination and calibration. However, the 4C Mortality score and SEIMC showed reasonable model performance after external validation in an ICU cohort. Taken together, this shows that, despite the huge effort to develop many models early in the pandemic, their clinical value to support decision-making is, overall, poor. This highlights that data infrastructure for high-quality studies on model development, external validation, and implementation are required to improve data-driven decision support in future pandemics [34].
A direct comparison of model performance is hampered as case-mix differences exist between the model development population and the EICC cohort. These case-mix differences as well as possible explanations for the observed model performance, are extensively described in A.4 of the Appendix. Except for the APACHE II score and SOFA score, the included models were developed and/or validated in hospitalized patients or outpatients, with none of them or only a small subset of the cohort being admitted to the ICU. All patients included in the EICC cohort, on the contrary, were admitted to the ICU, indicating more severe illness and/or advanced disease course. Furthermore, in the ICU, patient selection likely played a role as patients with a high age and burden of comorbidities were often excluded from ICU admission. The EICC cohort reflects a case-mix with a relatively homogeneous population compared to model development studies on the hospital ward or general population, as patients at highest risk, who are not accepted for ICU admission, and lowest risk, not requiring intensive organ support were likely not included. However, considerable heterogeneity was observed in the EICC cohort [16], also illustrated by differences in model performance between the Euregio country parts. Since the discriminatory performance depends on case-mix variability, models developed or validated in hospitalized or outpatient populations showed lower AUCs after external validation in our relatively homogeneous ICU cohort [32,33]. Previous validation studies evaluating prediction models in other cohorts often included general populations, explaining why higher AUCs are observed compared to the EICC cohort. Therefore, it is inappropriate to directly compare AUC from validation studies in a general population to the ICU population. Nevertheless, high-quality prediction models could support a multifactorial decision when stress on ICU bed availability increases during a pandemic, particularly when driven by an intervening national healthcare policy [16,35].

Limitations
We evaluated nine prognostic models, including only one model at unclear risk of bias, five models at high risk of bias, and three established models with moderate to poor performance, which indicates that there is still a lack of well-performing and valid prediction models for the ICU population. However, we could not evaluate all high risk of bias prediction models as data on certain variables were missing, excluding these prediction models. Our analyses cannot provide evidence that other high risk of bias models should be discouraged, although as a proof of concept, our study may warrant caution, at the very least. Furthermore, we externally validated the APACHE II score instead of the more recent and advanced APACHE IV score [36] as data for the APACHE II score were more complete. Another limitation was the lack of information after transport to another ICU for 25 patients. However, we performed sensitivity analyses without these patients that showed comparable results. In addition, the original article of certain models did not report an intercept, and calibration could therefore not be assessed. The included COVID-19 prediction models were developed in the early phase of the pandemic and externally validated using patient data from the first pandemic wave. The dynamic development of the virus was not considered and, therefore, our results could not be generalized to ICU patients admitted later in the pandemic and suffering from other SARS-CoV-2 variants. However, the first pandemic wave data were used since the stress on healthcare systems and the accompanying need for prediction was highest during that period. As considerable heterogeneity is observed between SARS-CoV-2 variants and pandemic waves, models should be externally validated or updated in other pandemic wave cohorts [37,38]. Model updating and extension could further improve model performance which has not been performed yet [32,33]. Our study, therefore, sets the stage for model updating and extension of the promising 4C Mortality score and SEIMC model.

Conclusions
In this study, nine out of 238 available COVID-19 prognostic models were externally validated in the EICC cohort a Discrimination is reported as the pooled area under the ROC curve with 95% CI for all 56 imputed sets using random-effects meta-analysis. ROC, receiver operating characteristic; CI, confidence interval. b Calibration-in-the-large is reported as the pooled difference between the predicted and observed mortality risk with 95% CI for all 56 imputed sets using random-effects meta-analysis. Positive values suggest overestimation, whereas negative values suggest underestimation. CI, confidence interval. c Intercept not reported or risk score.
based on routinely collected data. Only two of these nine models, the 4C Mortality score and the SEIMC, showed reasonable discrimination and moderate calibration. For future pandemics, better prediction models based on routine data are essential to improve data-driven decision support. Therefore, data infrastructure for high-quality studies on model development and external validation are required.