Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review

Objectives: Missing data is a common problem during the development, evaluation, and implementation of prediction models. Although machine learning (ML) methods are often said to be capable of circumventing missing data, it is unclear how these methods are used in medical research. We aim to ﬁnd out if and how well prediction model studies using machine learning report on their handling of missing data. Study design and setting: We systematically searched the literature on published papers between 2018 and 2019 about primary studies developing and/or validating clinical prediction models using any supervised ML methodology across medical ﬁelds . From the retrieved studies information about the amount and nature (e.g. missing completely at random, potential reasons for missingness) of missing data and the way they were handled were extracted. Results: We identiﬁed 152 machine learning-based clinical prediction model studies. A substantial amount of these 152 papers did not report anything on missing data (n = 56/152). A majority (n = 96/152) reported details on the handling of missing data (e.g., methods used), though many of these (n = 46/96) did not report the amount of the missingness in the data. In these 96 papers the authors only sometimes reported possible reasons for missingness (n = 7/96) and information about missing data mechanisms (n = 8/96). The most common approach for handling missing data was deletion (n = 65/96), mostly via complete-case analysis (CCA) (n = 43/96). Very few studies used multiple imputation (n = 8/96) or built-in mechanisms such as surrogate splits (n = 7/96) that directly address missing data during the development, validation, or implementation of the prediction model. Conclusion: Though missing values are highly common in any type of medical research and certainly in the research based on routine healthcare data, a majority of the prediction model studies using machine learning does not report sufﬁcient information on the presence and handling of missing data. Strategies in which patient data are simply omitted are unfortunately the most often used methods, even though it is generally advised against and well known that it likely causes bias and loss of analytical power in prediction model development and in the predictive accuracy estimates. Prediction model researchers should be much more aware of alternative methodologies to address missing data.


Introduction
Thorough contemplation about the handling and reporting of missing data is an integral part of any research addressing and using clinical data, including clinical prediction model research [1][2][3][4][5][6] . Clinical prediction models use multiple input variables (i.e., covariates, predictors) to calculate the absolute risk of a specific outcome presence (diagnostic models) or incidence (prognostic models). In the medical literature, most diagnostic and prognostic prediction models are derived or validated using regression modelling strategies. When missing values are present in the model development or validation sample, additional efforts preparatory to model development are required.
The most common approach is to adopt a completecase analysis (CCA), wherein individuals with missing data on any of the predictor or outcomes variables are (automatically) deleted from the analysis [ 7 , 8 ]. Although this strategy is (only) valid under very stringent circumstances, it is generally inefficient and can lead to severe bias in estimates of the estimated model parameters (e.g., regression coefficients) and thus in the model's predictive performance [ 3 , 9 , 10 ]. For example, removing incomplete cases could lead to loss of a significant number of informative observations.
For this reason, it is generally recommended to implement multivariable imputation models that generate multiple imputations conditionally on other (observed) patient characteristics [9][10][11][12][13] . When multiple imputation is used during prediction model development, multiple completed versions of the incomplete datasets are generated in which the prediction model coefficients are estimated separately. The model coefficients from each imputed dataset are then pooled using Rubin's rules, and subsequently used for calculating absolute risk probabilities in new patients [ 10 , 11 ]. Although multiple imputation strategies are consequently applied to an entire prediction model development or validation dataset, it is possible to generate imputations tailored to individual patients [ 14 , 15 ]. This also makes it possible use multiple imputation techniques when actually implementing and applying prediction models in electronic healthcare software in daily clinical practice [13][14][15][16] .
Yet another approach is to address missing data directly during the prediction model development, validation, or application. This strategy can, for instance, be achieved by including missing indicator variables, by adopting patternmixture models, tree-based ensembles, or other machine learning (ML) methods that circumvent the use of missing data imputation ( Box 1 ) [17][18][19][20][21][22] .
Existing prediction model reporting guidelines (TRI-POD), congruent with the increasing amount of supportive literature, recommend to at least report whether prediction model development sets and validation sets indeed suffered the presence of missing data and to what extent, and how such missing data were addressed in the analysis [ 1 , 2 , 10 , 23 , 24 ]. So far, adherence to these reporting guidelines seems to be limited in applied prediction research. Even in prediction model studies that adopt more traditional (regression-based) methods, many reviews have found that missing data is often inadequately handled or completely ignored [25][26][27][28][29][30] .
With the emergence of ML methods for prediction modeling, which may circumvent the need for imputation (e.g., random forests with surrogate splits), it becomes less evident whether and how missing data is handled during model development or validation. The question remains how often researchers adopting these ML methods make use of alternative and proper strategies and in what way. The objective of this study is, therefore, to investigate how well prediction model studies that used ML based techniques reported on the presence, nature and extent of missing data in the used data sets, and which methods were commonly used for handling missing data during prediction model development, validation, or (if done) implementation.

Methods
In a recent review by Andaur Navarro et al. we systematically searched the medical literature for primary studies developing and/or validating prediction models using any supervised ML methodology, published between January 2018 and December 2019 [ 31 , 32 ]. The protocol of which was registered and published (PROSPERO, CRD42019161764) [33] . The search initially yielded 24.814 results, from which 10 random sets of 249 articles were sampled. From the sampled 2.482 publications, 152 were included in the review. The present review uses the same data set of this review ( Fig. 1 ). Similarly for the present review, articles were eligible for inclusion when a primary study described the development or validation of a multivariable prediction model using any kind of supervised ML methodology. We defined a study using supervised ML as the use of algorithmic approaches to develop or validate a prediction model (e.g., any tree-based methods, neural networks, or support vector machines). We excluded studies that adopted common statistical techniques such as linear regression, logistic regression, lasso regression, ridge regression, or elastic net. Also, studies were excluded when only a single variable was studied. All human medical fields, with the notable exception of medical imaging, were included. To address the aim of the present review, first, a list of key reporting items that may facilitate the interpretation of prediction model studies in the presence of missing data, were defined ( Table 1 ). These items were based on prevailing reporting guidelines [ 1-3 , 10 ] and consider: 1) Information on the presence, amount, and distribution of missingness on the study variables, including reasons for the missing data and assumptions about the missing data mechanism; 2) Methods for missing data handling, including the type (e.g., imputation, missing indicator, surrogate splits); 3) Implementation details of the missing data method, including total number of imputed datasets and (auxiliary, i.e. not part of the prediction model) variables used in the imputation models ( Table 1 ). Existing machine learning reporting guidelines sparsely refer to the need to report on missing data details [34] . As a consequence, items specifically about the ML modeling techniques were based on key characteristics of known ML methods with built-in strategies to handle missing data [17][18][19][20] . Subsequently, we reviewed each eligible study and assessed whether missing data was present. For studies that reported the presence of missing data, we evaluated the level of reporting of the items listed in Table 1 . If applicable, data extraction was done both for the prediction model development and validation. When a sensitivity analysis was utilized, applied methods for handling missing data in these sensitivity analyses were also assessed separately. Supplementary material was considered when available. Ten percent of the total set was reviewed first by two reviewers (S.N., A.L.), in which disagreements were resolved for mutual learning by discussing the found discrepancies. The two reviewers then independently reviewed fifty percent of all studies respectively. Unresolved disagreements were resolved through consensus with a third reviewer (T.D.). All items used in the data extraction can be found in the Appendix. For the data extraction some reporting items (e.g., Item 2.1) about identifying and handling missing data from Table 1 were split up into several separate data extraction items.

Results
After screening, 152 eligible articles were available for the present study ( Fig. 1 ). A total of 56 (37%) prediction model studies did not report on missing data and could not be analyzed further. We included 96 (63%) studies which reported on the handling of missing data. Across the 96 studies, 46 (48%) did not include information on the amount or nature of the missing data.

Presence and mechanism of missing data
Papers that reported on the amount of missing data most often (n = 31/50 [62%]) reported the overall number or frequency of missingness (e.g., the total number of patients or variables with one or more missing values). For these papers, the overall median percentage of missingness was 4.7% (IQR 1.85-28). In most other cases it was unclear how many values were missing. It was often unclear which variables exactly were missing (n = 39/50 [78%]). In 7 papers it was explicitly stated that the outcome was missing [14%]. Only a small proportion of papers provided possible reasons for missingness of predictor values (n = 7/50 [14%]) or compared the characteristics of patients with and without any missing values (n = 5/50 [10%]). Additionally, a statement about the (potential) mechanism by which the data were missing was seldom reported (n = 8/50 [16%]).

Handling of missing data
From the 96 papers reporting on missing data handling, the most common approach was deletion (n = 65/96 [68%]), with the majority using complete case analysis (CCA) (n = 43/65 [66%]). About a third of papers reporting on missing data handling, used imputation (n = 36/96 [38%]), most often single imputation (23/36 [61%]) with the mean (12/23 [52%]). Only a handful used the recommended multiple imputation (n = 8/36 [22%]). Of these 8 papers, important details such as the number of imputed datasets, whether predictor and outcome variables were included in the imputation models, exact imputation method applied, or whether auxiliary variables were used, was only rarely reported (1-3 papers). Missing indicators were used by some authors (n = 8/96 [8%]), most often in combination with any deletion or imputation method   There were many studies (n = 23/96 [24%]) where a combination of missing data handling methods was used, most often combining deletion practices with imputation methods (n = 15/23 [65%]). Only sometimes were these reported as sensitivity analyses (n = 3/23 [13%]). There were no studies in which a submodel approach was used.
A complete overview of the extracted data can be found in the Appendix.

Discussion
This work comprised a comprehensive review of 152 ML-based clinical prediction model development or validation studies, to evaluate the reporting and methodological quality with regards to the presence, amount, and handling of missing data in such studies. Consistent with similar reviews on the reporting of prediction models or missing data, the quality of reporting in ML-based prediction model studies with regards to missing data was generally poor. This makes the judgement of the validity of the reported prediction models or their predictive accuracy difficult or even impossible [ 25 , 35 ]. Examples of common pitfalls in the handling of missing data largely match that of similar reviews which analyzed studies reporting on prevailing statistical models: the exclusion of study participants with any missing data and a lack of primary details on the amount or nature of the missing data, and the imputation methods used, if done ( Fig. 2 ).
Methods such as CCA and single imputation, often via mean imputation (52%), were highly common in the ML studies included in this review. It can seem efficient to apply methods such as mean imputation or CCA, but it is generally expected that these ad-hoc methods are unfit for working with healthcare data [ 7 , 11 , 13 , 36 ]. Only under stringent circumstances to which healthcare data, and certainly not routine healthcare data, usually do not abide, mean imputation and CCA could provide unbiased estimates. Similarly, there are strong recommendations to avoid the use of missing indicators, for example because it may alter the way clinicians approach the use of a predictive model, given that the model suggests missing data may also be informative [ 7 , 22 , 36 , 37 ]. Likewise, missing indicators require continued monitoring and dynamic revision for the various different missing data circumstances upon which they may be used, which is incredibly convoluted when applied in a medical decision-making context [38] . Surprisingly, this method is often used by studies using a non-imputation-based approach (53%). This tendency in combination with frequent absence of explicit motivations for choosing certain missing data handling strategies and sparse reference to missing data in existing machine learning reporting guidelines, illustrate an overall lack of appreciation about the severe consequences of improper handling of missing data in prediction model studies and also in clinical decision making based on prediction models.
Overall, there is clearly room for improvement in the strategies for handling missing values of the prediction model studies adopting state-of-the-art ML methods. Although multiple imputation is currently considered the gold standard, it is only rarely implemented in these published studies (8/152 [5%]). In addition, several alternative strategies (e.g., pattern-mixture models, surrogate splits, etc.) are available that circumvent the need for imputation. These strategies may be particularly appealing to enhance the development, validation and implementation of developed prediction models, as they offer a unified approach to generate predictions in the presence of missing data. Still, among these approaches, it is yet unclear which is to be preferred, and consensus about their effectivity when compared with, more classical, missing data handling methods is lacking; more research on this is warranted [ 18 , 19 , 39 ].
The level of reporting is arguably just as important as the quality of an imputation model. Sufficient detail to be able to replicate the study is a key obligation of scientific research and reporting. Almost all studies that used multiple imputation lacked sufficient detail on which variables were included, the conditional imputation models used, and the number of multiple imputed datasets. Also, the limited utilization of sensitivity analyses suggests that authors did not consider the potential consequences of handling missing data much. Further, the lack of detail on which variables were included in the imputation model suggests that known extensions that can improve the accuracy of the imputation model (e.g., use of auxiliary variables) are unexploited [ 15 , 40 ]. To promote good missing-data-handlingpractice, we echo previous recommendations to acknowledge sufficient reporting on missing data and any applied missing data handling method, to allow others to interpret the quality of the results, to allow for their replication and to enhance the application of the prediction model [ 10 , 25 , 26 ]. Furthermore, journals are encouraged to ask for these details to be published in the original text or as supplementary files.
Many included papers used prediction models based on decision trees or random forests, for which built-in capabilities exist for handling missing data during its development, validation and implementation [ 17 , 18 ]. Most authors, however, did not clarify whether and how these were used. It is possible that many authors used the default way of handling missing data as programmed for these models, i.e, usually CCA. However, due to the limited inclusion of programming details (i.e., code, libraries and packages) it remains largely uncertain how often these methods were used. The implementation of automated or built-in missing data handling methods is rare in software packages, which may explain their underreported use. Another possibility is that these built-in methods are taken for granted, which again suggests that there may be an overall lack of knowledge about the consequences of improper missing data handling. There is generally no consensus on how well these built-in methods work with regards to clinical prediction model development, validation or implementation, which warrants additional research and caution when using them in the presence of missing data [ 18 , 19 , 39 ].
A limitation of our review may be related to the restricted search strategy from the original review, as only articles published in PubMed over a time span of two years (between January 2018 and December 2019) were considered and only a subsample (n = 2.482) from the initial search results (n = 24.814) was screened [33] . However, we believe that even with these restrictions the final study sample remains representative of the current status in the field, since no recent reporting or methods guideline were likely issues that may have caused any improvements since then.
To our knowledge, this is the first comprehensive review evaluating the level of reporting and handling of missing data in ML-based clinical prediction model studies. We believe this review of a representative sample of model development and validation prediction model studies in healthcare has highlighted severe issues with the general conduct and reporting of missing data in ML-based prediction model studies. It is well known that inappropriate handling of missing data can greatly reduce the validity and generalizability of predictions and corresponding estimates of prediction model performance [ 1 , 5 ]. An improved understanding about the negative consequences of inappropriate handling of missing data and effective ways to remedy these issues through improved conduct and reporting is warranted. We recommend authors to take note of and appreciate the existing reporting guidelines (notably, TRI-POD and STROBE) when publishing ML-based prediction model studies. These guidelines include a minimal set of reporting items that help to improve the interpretation and reproducibility of research findings.

Box 1 Prediction with built-in missing data handling
Missing indicator. For each variable in the model a dichotomous dummy variable (0/1) is added to indicate whether that variable is missing or not [ 7 , 22 , 36 , 41 ]. These dummy variables are then included in the statistical (i.e., risk prediction) model as separate predictors. The original, missing, predictor variable is usually set to 0. Missing indicators may contain relevant information for predictions, but are susceptible to so-called feedback loops; as soon as a clinician is aware of the informative missingness in certain predictors, their predictive value changes [ 37 , 38 , 42 ]. Additionally, other issues may arise in the application of missing indicators as the manner of data collection between different practices is likely to vary [38] .
Surrogate splits. Preserves the partitioning of each original split as good as possible in the presence of missing predictor values [18][19][20] . Accordingly, the model, whenever it encounters a missing predictor value, will use the surrogate variable (rather than the missing predictor variable) to decide upon the split direction.
Sparsity aware splitting. A default direction is added for each tree node in a decision tree (e.g., XG-Boost) [17] . Whenever a missing predictor value is encountered, the instance is classified into the prespecified default direction. The optimal default direction, and thus best direction to handle missing data, is learnt from the data.
Pattern-mixture models. For each pattern of missing data, a separate risk prediction model is made and included in the pattern-mixture model [21] . Then, when applied to a new (out-of-sample) individual the corresponding (i.e., matching the missing data pattern in the individual) prediction model is used.

Data availability statement
The data that support the findings of this study are available from upon reasonable request.