Abstract
Background and Objectives
Methods
Results
Conclusion
Systematic review registration
Keywords
- •Design and methodological conduct of studies on clinical prediction models based on machine learning vary substantially.
Key findings
- •Studies on clinical prediction models based on machine learning suffered from poor methodology and reporting similar to studies using regression approaches.
What this adds to what was known?
- •Methodologies for model development and validation should be more carefully designed and reported to avoid research waste.
- •More attention is needed to missing data, internal validation procedures, and calibration.
- •Methodological guidance for studies on prediction models based on machine learning techniques is urgently needed.
What is the implication and what should change now?
1. Introduction
2. Methods
2.1 Eligibility criteria
2.2 Screening and selection process
2.3 Extraction of data items
2.4 Summary measures and synthesis of results
3. Results

Key characteristics | Total (n = 152) |
---|---|
n (%) [95% CI] | |
Study aim | |
Diagnosis | 58 (38.2) [30.8–46.1] |
Prognosis | 94 (61.8) [53.9–69.2] |
Study type | |
Model development only | 133 (87.5) [81.3–91.8] |
Model development with external validation | 19 (12.5) [8.2–18.7] |
Outcome aim | |
Classification | 120 (78.9) [71.8–84.7] |
Risk probabilities | 32 (21.0) [80.5–91.3] |
Setting | |
General population | 17 (11.2) [7.1–17.2] |
Primary care | 15 (9.9) [6.1–15.6] |
Secondary care | 32 (21.1) [15.3–28.2] |
Tertiary care | 78 (51.3) [43.4–59.1] |
Unclear | 13 (8.6) [5.1–14.1] |
Outcome format | |
Continuous | 7 (4.6) [2.2–9.2] |
Binary | 131 (86.2) [79.8–90.8] |
Multinomial | 7 (4.6) [2.2–9.2] |
Ordinal | 2 (1.3) [0.4–4.7] |
Time-to-event | 3 (2.0) [0.7–5.6] |
Count | 2 (1.3) [0.4–4.7] |
Type of outcome | |
Death | 21 (13.8) [9.2–20.2] |
Complications | 65 (42.8) [35.2–50.7] |
Disease detection | 30 (19.7) [14.2–26.8] |
Disease recurrence | 9 (5.9) [3.1–10.9] |
Survival | 3 (2.0) [0.7–5.6] |
Readmission | 4 (2.6) [1–6.6] |
Other | 20 (13.2) [8.7–19.5] |
Mentioning of reporting guidelines | |
TRIPOD | 8 (5.3) [2.7–10] |
STROBE | 3 (2.0) [0.7–5.6] |
Other | 5 (3.3) [1.4–7.5] |
None | 139 (91.4) [85.9–94.9] |
Model availability | |
Repository for data | 18 (11.8) [7.6–17.9] |
Repository for code | 13 (8.6) [5.1–14.1] |
Model presentation | 31 (20.4) [14.8–27.5] |
None | 121 (79.6) [72.5–85.2] |
Modelling algorithm | All extracted models (n = 522) |
---|---|
n (%) [95% CI] | |
Unpenalized regression models | 101 (19.3) [16.1–23.1] |
Ordinary least squares regression | 27 (5.2) [3.5–7.5] |
Maximum likelihood logistic regression | 74 (14.2) [11.4–17.5] |
Penalized regression models | 29 (5.6) [3.8–8] |
Elastic Net | 9 (1.7) [0.8–3.4] |
LASSO | 13 (2.5) [1.4–4.3] |
Ridge | 7 (1.3) [0.6–2.9] |
Tree-based models | 166 (31.8) [28–36] |
Decision trees (for example, CART) | 46 (8.8) [6.6–11.7] |
Random forest | 73 (14) [11.2–17.3] |
Extremely randomized trees | 1 (0.2) [0.01–1.2] |
Regularized Greedy Forest | 1 (0.2) [0.01–1.2] |
Gradient boosting machine | 34 (6.5) [4.6–9.1] |
XGBoost | 11 (2.1) [1.1–3.9] |
Neural Network (incl. deep learning) | 75 (14.4) [11.5–17.7] |
Support Vector Machine | 86 (16.5) [13.5–20] |
Naïve Bayes | 22 (4.2) [2.7–6.4] |
K-nearest neighbor | 15 (2.9) [1.7–4.8] |
Superlearner ensembles | 14 (2.7) [1.5–4.6] |
Other | 10 (1.9) [1-3–6] |
Unclear | 4 (0.8) [0.2–2.1] |
3.1 Participants
Key items | Total (n = 152) | Development only (n = 133) | Development with external validation (n = 19) |
---|---|---|---|
n (%) [95% CI] | n (%) [95% CI] | n (%) [95% CI] | |
Data sources | |||
Prospective cohort | 50 (32.9) [25.9–40.7] | 43 (32.3) [25–40.7] | 7 (36.8) [19.1–59] |
Retrospective cohort | 48 (31.6) [24.7–39.3] | 45 (33.8) [26.3–42.2] | 4 (21.1) [8.5–43.3] |
Randomized Controlled Trial | 3 (2.0) [0.7–5.6] | 2 (1.5) [0.4–5.3] | 1 (5.3) [0.3–24.6] |
EMR | 30 (19.7) [14.2–26.8] | 28 (21.1) [15–28.7] | 0 |
Registry | 18 (11.8) [7.6–17.9] | 15 (11.3) [7–17.8] | 4 (21.1) [8.5–43.3] |
Administrative claims | 4 (2.6) [1–6.6] | 4 (3.0) [1.2–7.5] | 0 |
Case-control | 18 (11.8) [7.6–17.9] | 15 (11.3) [7–17.8] | 3 (15.8) [5.5–37.6] |
Number of centers | 110 (72.4) | 98 (73.7) | 12 (63.2) |
Median [IQR] (range) | 1 [1–3], 1 to 51,920 | 1 [1–3], 1 to 712 | 1 [1–10], 1 to 51,920 |
Follow-up (mo) | 47 (30.9) | 39 (29.2) | 8 (42.1) |
Median [IQR] (range) | 41.9 [3–60], 0.3 to 307 | 43.6 [4.5–60], 0.3 to 307 | 33.5 [1.75–42], 1 to 144 |
Predictor horizon (mo) | 49 (32.2) [25.3–40] | 61 (45.9) [37.6–54.3] | 7 (36.8) |
Median [IQR] (range) | 8.5 [1–36], 0.03 to 120 | 6 [1–33.5], 0.03 to 120 | 36 [6.5–60], 1 to 60 |
Sample size justification | 27 (17.8) [12.5–24.6] | 24 (18.0) [12.4–25.4] | 3 (15.8) |
Power | 5 (18.5) [8.2–36.7] | 5 (20.8) [9.2–40.5] | 0 |
Justified time interval | 5 (18.5) [8.2–36.7] | 3 (12.5) [4.3–31] | 2 (66.7) |
Size of existing/available data | 16 (59.3) [40.7–75.5] | 15 (62.5) [42.7–78.8] | 1 (33.3) |
Events per variable | 1 (3.7) [0.2–18.3] | 1 (4.2) [0.2–20.2] | 0 |
Internal validation | |||
Split sample with test set | 86 (56.6) [48.6–64.2] | NA | NA |
(Random) split | 49 (57) [46.4–66.9] | ||
(Nonrandom) split | 9 (10.5) [5.6–18.7] | ||
Split | 28 (32.6) [23.6–43] | ||
Bootstrapping | 5 (3.3) [1.4–7.5] | NA | NA |
With test set | 3 (60.0) [23.1–88.2] | ||
With cross-validation | 1 (20) [1–62.4] | ||
Cross-validation | 70 (46.1) [38.3–54] | NA | NA |
Non-nested (single) | 32 (45.7) [34.6] | ||
Nested | 10 (14.3) [7.9–24.3] | ||
With test set | 24 (34.3) [24.2–46] | ||
External validation | |||
Chronological | NA | NA | 5 (26.3) [11.8-48.8] |
Geographical | NA | NA | 3 (15.8) [5.5-37.6] |
Independent dataset | NA | NA | 11 (57.9) [36.3-76.9] |
Fully independent dataset | NA | NA | 8 (42.1) [23.1-63.7] |
3.2 Data sources
3.3 Outcome
3.4 Candidate predictors
Key items | Total (n = 152) |
---|---|
n (%) [95% CI] | |
Type of candidate predictors | |
Demography | 120 (78.9) [71.8–84.7] |
Clinical history | 111 (73.0) [65.5–79.4] |
Physical examination | 0 |
Blood or Urine parameters | 63 (41.4) [33.9–49.4] |
Imaging | 49 (32.2) [52.3–40] |
Genetic risk score | 7 (4.6) [2.2–9.2] |
Pathology | 16 (10.5) [6.6–16.4] |
Scale score | 31 (20.4) [14.8–27.5] |
Questionnaires | 0 |
Treatment as candidate predictor | |
Yes | 36 (23.7) [17.6–31] |
No | 80 (52.6) [44.7–60.4] |
Not applicable | 36 (23.7) [17.6–31] |
Continuous variables as candidate predictors | |
Yes | 131 (86.2) [79.8–90.8] |
Unclear | 17 (11.2) [7.1–17.2] |
A-priori selection of candidate predictors | |
Yes | 63 (41.4) [33.9–49.4] |
No | 47 (30.9) [24.1–38.7] |
Unclear | 42 (27.6) [21.1–35.2] |
Methods to handle continuous predictors, | |
Linear (no change) | 13 (8.6) [5.1–14.1] |
Nonlinear (planned) | 2 (1.3) [0.4–4.7] |
Nonlinear (unplanned) | 4 (2.6) [1–6.6] |
Categorized (some) | 16 (10.5) [6.6–16.4] |
Categorized (all) | 18 (11.8) [7.6–17.9] |
Unclear | 104 (68.4) [60.7–75.3] |
Categorization of continuous predictors | |
Data dependent | 4 (2.6) [1–6.6] |
No rationale | 17 (11.2) [7.1–17.2] |
Based on previous literature or standardization | 13 (8.6) [5.1–14.1] |
Not reported | 118 (77.6) [70.4–83.5] |
3.5 Sample size
Key items | Total (n = 152) | |
---|---|---|
n (%) | Median [IQR], range | |
Initial sample size | 93 (61.2) | 999 [272–24,522], 8 to 1,093,177 |
External validation | 13 (68.4) | 318 [90–682], 19 to 1,113,656 |
Final sample size | 151 (99.3) | 587 [172–6,328], 8 to 594,751 |
Model development | 83 (54.6) | 641 [226–10,512], 5 to 392,536 |
Internal validation | 83 (54.6) | 230 [75–2,892], 2 to 202,215 |
External validation | 18 (94.7) | 293 [71–1,688], 19 to 59,738 |
Initial number of events | 10 (6.6) | 66 [15–207], 15 to 4,370 |
External validation | 1 (5.3) | 107 |
Final number of events | 37 (24.3) | 106 [5–-364], 15 to 7,543 |
Model development | 19 (13.2) | 156 [47–353], 10 to 5,054 |
Internal validation | 19 (13.2) | 35 [26–109], 4 to 2,489 |
External validation | 4 (21.1) | 250 [121–990], 107 to 2,834 |
Number of candidate predictors | 119 (78.3) | 24 [13–112], 2 to 39,212 |
Number of included predictors | 90 (59.2) | 12 [7–23], 2 to 570 |
Events per candidate predictor | 28 (18.4) | 12.5 [5.7–27.7], 1.2 to 754.3 |
3.6 Missing values
Key items | Total (n = 152) | Development only (n = 133) | Development with external validation (n = 19) |
---|---|---|---|
n (%) [95% CI] | n (%) [95% CI] | n (%) [95% CI] | |
Missingness as exclusion criteria for participants | |||
Yes | 56 (36.8) [29.6–44.7] | 51 (38.3) [30.5–46.8] | 2 (10.5) [2.9–31.4] |
Unclear | 36 (23.7) [17.6–31] | 33 (24.8) [18.2–32.8] | 6 (31.6) [15.4–54] |
Number of patients excluded | 36 (23.7) [17.6–31] | 34 (25.6) [18.9–33.6] | 0 |
Median [IQR] (range) | 191 [19–4,209], (1 to 627,180) | 224 [16–4,699], (1 to 627,180) | 0 |
Methods of handling missing data | |||
No missing data | 4 (2.6) [1–6.6] | 3 (2.3) [0.8–6.4] | 1 (5.3) [0.3–24.6] |
No imputation | 4 (2.6) [1–6.6] | 4 (3) [1.2–7.5] | 0 |
Complete case-analysis | 30 (19.7) [14.2–26.8] | 28 (21.1) [15–28.7] | 2 (10.5) [2.9–31.4] |
Mean imputation | 4 (2.6) [1–6.6] | 3 (2.3) [0.8–6.4] | 1 (5.3) [0.3–24.6] |
Median imputation | 10 (6.6) [3.6–11.7] | 10 (7.5) [4.1–13.3] | 0 |
Multiple imputation | 6 (3.9) [1.8–8.3] | 6 (4.5) [2.1–9.5] | 0 |
K-nearest neighbor imputation | 5 (3.3) [1.4–7.5] | 5 (3.8) [1.6–8.5] | 0 |
Replacement with null value | 3 (2.0) [0.7–5.6] | 1 (0.8) [0–4.1] | 2 (10.5) [2.9–31.4] |
Last value carried forward | 4 (2.6) [1–6.6] | 4 (3) [1.2–7.5] | 0 |
Surrogate variable | 1 (0.7) [0–3.6] | 1 (0.8) [0–4.1] | 0 |
Random forest imputation | 4 (2.6) [1–6.6] | 3 (2.3) [0.8–6.4] | 1 (5.3) [0.3–24.6] |
Categorization | 3 (2) [0.7–5.6] | 2 (1.5) [0.4–5.3] | 1 (5.3) [0.3–24.6] |
Unclear | 6 (3.9) [1.8–8.3] | 5 (3.8) [1.6–8.5] | 1 (5.3) [0.3–24.6] |
Presentation of missing data | |||
Not summarized | 129 (84.9) [78.3–89.7] | 114 (85.7) [78.8–90.7] | 16 (84.2) [62.4–94.5] |
Overall | 6 (3.9) [1.8–8.3] | 4 (3) [1.2–7.5] | 2 (10.5) [2.9–31.4] |
By all final model variables | 3 (2) [0.7–5.6] | 3 (2.3) [0.8–6.4] | 0 |
By all candidate predictors | 13 (8.6) [5.1–14.1] | 11 (8.3) [4.7–14.2] | 1 (5.3) [0.3–24.6] |
By number of variables | 1 (0.7) [0–3.6] | 1 (0.8) [0–4.1] | 0 |
3.7 Class imbalance
Key items | Total (n = 152) |
---|---|
n (%) [95% CI] | |
Data preparation | 58 (38.2) [30.8–46.1] |
Cleaning | 21 (36.2) [25.1–49.1] |
Aggregation | 6 (10.3) [4.8–20.8] |
Transformation | 6 (10.3) [4.8–20.8] |
Sampling | 2 (3.4) [1–11.7] |
Standardization/Scaling | 11 (19) [10.9–30.9] |
Normalization | 22 (37.9) [26.6–50.8] |
Integration | 0 |
Reduction | 12 (20.7) [12.3–32.8] |
Other | 9 (15.5) [8.4–26.9] |
Data splitting | 86 (56.6) [48.6–64.2] |
Train-test set | 77 (50.7) [42.8–58.5] |
Train-validation-test set | 9 (5.9) [3.1–10.9] |
Dimensionality reduction techniques | 9 (5.9) [3.1–10.9] |
CART | 1 (11.1) [0.6–43.5] |
Principal component analysis | 3 (33.3) [12.1–64.6] |
Factor analysis | 1 (11.1) [0.6–43.5] |
Image decomposition | 1 (11.1) [0.6–43.5] |
Class imbalance | 27 (17.8) [12.5–24.6] |
Random undersampling | 4 (14.8) [5.9–32.5] |
Random oversampling | 5 (18.5) [8.2–36.7] |
SMOTE | 11 (40.7) [24.5–59.3] |
RUSBoost | 1 (3.7) [0.2–18.3] |
Other | 7 (25.9) [13.2–44.7] |
Strategy for hyperparameter optimization | 44 (28.9) [22.3–36.6] |
Grid search (no further details) | 5 (3.3) [1.4–7.5] |
Cross-validated grid search | 14 (9.2) [5.6–14.9] |
Randomized grid search | 1 (0.7) [0–3.6] |
Cross-validation | 15 (9.9) [6.1–15.6] |
Manual search | 1 (0.7) [0–3.6] |
Predefined values/default | 3 (2) [0.7–5.6] |
Bayesian optimization | 2 (1.3) [0.4–4.7] |
Tree-structured parzen estimator method | 1 (0.7) [0–3.6] |
Unclear | 4 (2.6) [1–6.6] |
3.8 Modelling algorithms
3.9 Selection of predictors
Key items | Total (n = 522) |
---|---|
n (%) [95% CI] | |
Selection of predictors | |
Stepwise | 8 (1.5) [0.7–3.1] |
Forward selection | 31 (5.9) [4.1–8.4] |
Backward selection | 5 (1) [0.4–2.4] |
All predictors | 72 (13.8) [11–17.1] |
All significant in univariable analysis | 27 (5.2) [3.5–7.5] |
Embedded in learning process | 192 (36.8) [32.7–41.1] |
Other | 19 (3.6) [2.3–5.7] |
Unclear | 168 (32.2) [28.2–36.4] |
Hyperparameter tunning reported | |
Yes | 160 (30.7) [26.7–34.8] |
No | 283 (54.2) [49.8–58.5] |
Not applicable/Unclear | 79 (15.1) [12.2–18.6] |
Variable importance reported | |
Mean decrease in accuracy | 26 (5) [3.3–7.3] |
Mean decrease in node impurity | 31 (5.9) [4.1–8.4] |
Weights/correlation | 10 (1.9) [1–3.6] |
Gain information | 24 (4.6) [3–6.9] |
Unclear method | 115 (22) [18.6–25.9] |
None | 316 (60.5) [56.2–64.7] |
Penalization methods used | |
None | 481 (92.1) [89.4–94.2] |
Uniform shrinkage | 3 (0.6) [0.1–1.8] |
Penalized estimation | 27 (5.2) [3.5–7.5] |
Other | 11 (2.1) [1.1–3.9] |
3.10 Variable importance and hyperparameters
3.11 Performance metrics
Key items | All extracted models (n = 522) | |
---|---|---|
n (%) [95% CI] | ||
DEV | VAL | |
Calibration | ||
Calibration plot | 23 (4.4) [2.9–6.6] | 1 (0.2) [0.01–1.2] |
Calibration slope | 17 (3.3) [2–5.3] | 1 (0.2) [0.01–1.2] |
Calibration intercept | 16 (3.1) [1.8–5] | 1 (0.2) [0.01–1.2] |
Calibration in the large | 1 (0.2) [0.01–1.2] | 0 |
Calibration table | 1 (0.2) [0.01–1.2] | 0 |
Kappa | 10 (1.9) [1–3.6] | 0 |
Observed/expected ratio | 1 (0.2) [0.01–1.2] | 0 |
Homer-Lemeshow statistic | 4 (0.8) [0.3–2.1] | 0 |
None | 494 (94.6) [92.3–96.3] | |
Discrimination | ||
AUC/AUC-ROC | 349 (66.9) [62.6–70.9] | 46 (8.8) [6.6–11.7] |
C-statistic | 9 (1.7) [0.8–3.4] | 0 |
None | 164 (31.4) [27.5-–35.6] | |
Classification | ||
NRI | 9 (1.7) [0.8–3.4] | 0 |
Sensitivity/Recall | 239 (45.8) [41.5–50.2] | 30 (5.7) [4–8.2] |
Specificity | 193 (37) [32.8–41.3] | 22 (4.2) [2.7–6.4] |
Decision-analytic | ||
Decision Curve Analysis | 2 (0.4) [0.01–1.5] | 0 |
IDI | 1 (0.2) [0.01–1.2] | 0 |
Overall | ||
R2 | 14 (2.7) [1.5–4.6] | 0 |
Brier score | 19 (3.6) [2.3–5.7] | 6 (1.1) [0.5–2.6] |
Predictive values | 160 (30.7) [26.8–34.8] | 10 (1.9) [1–3.6] |
AUC difference | 2 (0.4) [0.01–1.5] | 0 |
Accuracy | 234 (44.8) [40.5–49.2] | 26 (5) [3.4–7.3] |
F1-score | 79 (15.1) [12.2–18.6] | 0 |
Mean square error | 21 (4) [2.6–6.2] | 0 |
Misclassification rate | 9 (1.7) [0.8–3.4] | 0 |
Mathew's correlation coefficient | 5 (1) [0.4–2.4] | 0 |
AUPR | 21 (4) [2.6–6.2] | 0 |
3.12 Uncertainty quantification
3.13 Predictive performance
Key items | All extracted models (n = 522) | |||||
---|---|---|---|---|---|---|
Reported, n (%) | Apparent performance | Reported, n (%) | Corrected performance | Reported, n (%) | Externally validated performance | |
Median [IQR], range | Median [IQR], range | Median [IQR], range | ||||
Calibration | ||||||
Slope | 11 (1.9) | 1.05 [1.02–1.07], 0.53 to 1.46 | 15 (2.9) | 1.3 [1–4], 0.52 to 17.6 | 4 (0.8) | 9.9 [7.87–12.8], 5.7 to 17.6 |
Intercept | 10 (1.9) | 0.07 [0.05–0.12], −0.08 to 2.32 | 15 (2.9) | −0.01 [−1.85–0.15], −8.3 to 2.74 | 4 (0.8) | −4.5 [−5.7 to −3.8], −8.3 to −3 |
Calibration-in-the-large | 1 (0.2) | −0.008 | 0 | 0 | ||
Observed:expected ratio | 1 (0.2) | 0.993 | 4 (0.8) | 0.99 [0.98–1.01], 0.98 to 1.04 | 0 | |
Homer-Lemeshow | 2 (0.2) | Not significant | 0 | 0 | ||
Pearson chi-square | 1 (0.2) | Not significant | 0 | 0 | ||
Mean Calibration Error | 4 (0.8) | 0.81 [0.7–0.88], 0.51 to 0.99 | 0 | 0 | ||
Discrimination | ||||||
AUC | 249 (47.7) | 0.82 [0.74–0.90], 0.45 to 1.00 | 154 (29.5) | 0.82 [0.74–0.90], 0.46 to 0.99 | 46 (8.8) | 0.82 [0.73–0.98], 0.52 to 0.97 |
Accuracy | 128 (24.5) | 79.8 [72.6–89.8], 44.2 to 100 | 117 (22.4) | 81.4 [76–89.9], 17.8 to 97.5 | 9 (1.7) | 70 [64–87], 55 to 90 |
Sensitivity | 156 (29.9) | 74 [58.6–87.8], 0 to 100 | 103 (19.7) | 80 [66.3–89.7], 14.8 to 100 | 12 (2.3) | 77.5 [63.9–83.5], 0.7 to 91 |
Specificity | 122 (23.4) | 82.2 [73.3–89.7], 17 to 100 | 80 (15.3) | 83.2 [73.6–90.8], 46.6 to 100 | 10 (1.9) | 74.4 [64.8–86.7], 42 to 90.5 |
3.14 Internal validation
3.15 External validation
3.16 Model availability
4. Discussion
4.1 Principal findings
4.2 Comparison to previous studies
4.3 Strengths and limitations of this study
4.4 Implication for researchers, editorial offices, and future research
5. Conclusions
Acknowledgments
Appendix A. Supplementary Data
- Supplemental File 1
- Supplemental File 2
References
- Prognosis and prognostic research: what, why, and how?.BMJ. 2009; 338: 1317-1320
- Clinical prediction models: diagnosis versus prognosis.J Clin Epidemiol. 2021; 132: 142-145
- A short guide for medical professionals in the era of artificial intelligence.NPJ Digit Med. 2020; 3: 126
- Comparing different supervised machine learning algorithms for disease prediction.BMC Med Inform Decis Mak. 2019; 19: 281
- Biomedical research: increasing value, reducing waste.Lancet. 2014; 383: 101-104
- Appraising prediction research: a guide and meta-review on bias and applicability assessment using the Prediction model Risk of Bias ASsessment Tool (PROBAST).Nephrology. 2021; 26: 939-947
- Prediction models for cardiovascular disease risk in the general population: systematic review.BMJ. 2016; 353: i2416
- Developing risk prediction models for type 2 diabetes: a systematic review of methodology and reporting.BMC Med. 2011; 9: 103
- Risk of bias in studies on prediction models developed using supervised machine learning techniques: systematic review.BMJ. 2021; 375: n2281
- Predictive models for hospital readmission risk: a systematic review of methods.Comput Methods Programs Biomed. 2018; 164: 49-64
- A systematic review of the applications of artificial intelligence and machine learning in autoimmune diseases.NPJ Digit Med. 2020; 3: 30
- Methodological conduct of prognostic prediction models developed using machine learning in oncology: a systematic review.BMC Med Res Methodol. 2022; 22: 1-16
- The PRISMA 2020 statement: an updated guideline for reporting systematic reviews.BMJ. 2021; 372: n71
- Protocol for a systematic review on the methodological and reporting quality of prediction model studies using machine learning techniques.BMJ Open. 2020; 10: 1-6
- Critical appraisal and data extraction for systematic reviews of prediction modelling studies: the CHARMS checklist.PLoS Med. 2014; 11: e1001744
- Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): explanation and elaboration.Ann Intern Med. 2015; 162: W1-W73
- Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement.Ann Intern Med. 2015; 162: 55
- Guidelines for developing and reporting machine learning predictive models in biomedical research: a multidisciplinary view.J Med Internet Res. 2016; 18: e323
- SMOTE: synthetic minority over-sampling technique.J Artif Intell Res. 2002; 16: 321-357
- RUSBoost: a hybrid approach to alleviating class imbalance.IEEE Trans Syst Man Cybern Syst Hum. 2010; 40: 185-197
- All models are wrong, but many are useful: learning a variable’s importance by studying an entire class of prediction models simultaneously.J Mach Learn Res. 2019; 20: 1-81
- Reporting and interpreting decision curve analysis: a guide for investigators.Eur Urol. 2018; 74: 796-804
- Completeness of reporting of clinical prediction models developed using supervised machine learning: a systematic review.BMC Med Res Methodol. 2022; 22: 12
- Big data and machine learning in health care.JAMA. 2018; 319: 1317-1318
- Springer-Verlag, New York2009 The Elements of Statstical Learning: Data Mining, Inference, and Prediction. 2nd ed.
Breiman L. Random forests. California; Mach Learn 2001;45(1):5–32.
- Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Massachusetts2001
- Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints.BMC Med Res Methodol. 2014; 14: 137
- Internal validation of predictive models: efficiency of some procedures for logistic regression analysis.J Clin Epidemiol. 2007; 42: 774-781
- Prediction models need appropriate internal, internal-external, and external validation.J Clin Epidemiol. 2016; 69: 245-247
- Calibration: the Achilles heel of predictive analytics.BMC Med. 2019; 17: 1-7
- The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression.J Am Med Inform Assoc. 2022; 29: 1525-1534
- Treatment use in prognostic model research: a systematic review of cardiovascular prognostic studies.Diagn Progn Res. 2017; 1: 1-10
- Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review.J Clin Epidemiol. 2022; 142: 218-229
- Reporting of prognostic clinical prediction models based on machine learning methods in oncology needs to be improved.J Clin Epidemiol. 2021; 138: 60-72
- Transparent reporting of multivariable prediction models in journal and conference abstracts: TRIPOD for abstracts.Ann Intern Med. 2022; 173: 42-48
- External validation of multivariable prediction models: a systematic review of methodological conduct and reporting.BMC Med Res Methodol. 2014; 14: 40
- Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal.BMJ. 2020; 369: m1328
- Poor reporting of multivariable prediction model studies: towards a targeted implementation strategy of the TRIPOD statement.BMC Med. 2018; 16: 1-12
- Reporting and methods in clinical prediction research: a systematic review.PLoS Med. 2012; 9: 1-12
- Trends in the conduct and reporting of clinical prediction model development and validation: a systematic review.J Am Med Inform Assoc. 2022; 29: 983-989
- TRIPOD statement: a preliminary pre-post analysis of reporting and methods of prediction models.BMJ Open. 2020; 10: e041537
- Inconsistency in the use of the term “validation” in studies reporting the performance of deep learning algorithms in providing diagnosis from medical imaging.PLoS One. 2020; 15: 1-10
- Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence.BMJ Open. 2021; 11: e048008
- Machine learning and artificial intelligence research for patient benefit: 20 critical questions on transparency, replicability, ethics, and effectiveness.BMJ. 2020; 368: l6927
- PROBAST: a tool to assess the risk of bias and applicability of prediction model studies.Ann Intern Med. 2019; 170: 51-58
- PROBAST: a tool to assess risk of bias and applicability of prediction model studies: explanation and elaboration.Ann Intern Med. 2019; 170: W1-W33
- Empirical evidence of the impact of study characteristics on the performance of prediction models: a meta-epidemiological study.BMJ Open. 2019; 9: 1-12
- Reporting of artificial intelligence prediction models.Lancet. 2019; 393: 1577-1579
- What do we mean by validating a prognostic model?.Stat Med. 2000; 19: 453-473
- Predictive analytics in health care: how can we know it works?.J Am Med Inform Assoc. 2019; 26: 1651-1654
Article info
Publication history
Footnotes
Funding: GSC is funded by the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre (BRC) and by Cancer Research UK program grant (C49297/A27294). PD is funded by the NIHR Oxford BRC. RB is affiliated to the National Institute for Health and Care Research (NIHR) Applied Research Collaboration (ARC) West Midlands. The views expressed are those of the authors and not necessarily those of the NHS, NIHR, or Department of Health and Social Care. None of the funding sources had a role in the design, conduct, analyses, or reporting of the study or in the decision to submit the manuscript for publication.
Registration and protocol: This review was registered in PROSPERO (CRD42019161764). The study protocol can be accessed in https://doi.org/10.1136/bmjopen-2020-038832.
Competing interests: There are no conflicts of interest to declare.
Availability of data, code, and other materials: Articles that support our findings are publicly available. Template data collection forms, detailed data extraction on all included studies, and analytical code are available upon reasonable request.
Ethical approval: Not required for this work.
Declaration of interests: The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Author Contributions: Constanza L. Andaur Navarro: Conceptualization, Methodology, Investigation, Data Curation, Formal analysis, Writing - original draft, Writing - review & editing; Johanna A.A. Damen: Conceptualization, Methodology, Investigation, Writing - review & editing, Supervision; Maarten van Smeden: Conceptualization, Writing - review & editing; Toshihiko Takada: Investigation, Writing - review & editing; Steven WJ Nijman: Investigation, Writing - review & editing; Paula Dhiman: Conceptualization, Methodology, Investigation, Writing - review & editing; Jie Ma: Investigation, Writing - review & editing; Gary S Collins: Conceptualization, Methodology, Writing - review & editing; Ram Bajpai: Investigation, Writing - review & editing; Richard D Riley: Conceptualization, Methodology, Writing - review & editing; Karel GM Moons: Conceptualization, Methodology, Writing - review & editing, Supervision; Lotty Hooft: Conceptualization, Methodology, Writing - review & editing, Supervision.
Identification
Copyright
User license
Creative Commons Attribution (CC BY 4.0) |
Permitted
- Read, print & download
- Redistribute or republish the final article
- Text & data mine
- Translate the article
- Reuse portions or extracts from the article in other works
- Sell or re-use for commercial purposes
Elsevier's open access license policy