KEY CONCEPTS IN CLINICAL EPIDEMIOLOGY Handling missing data in clinical research

Because missing data are present in almost every study, it is important to handle missing data properly. First of all, the missing data mechanism should be considered. Missing data can be either completely at random (MCAR), at random (MAR), or not at random (MNAR). When missing data are MCAR, a complete case analysis can be valid. Also when missing data are MAR, in some situations a complete case analysis leads to valid results. However, in most situations, missing data imputation should be used. Regarding imputation methods, it is highly advised to use multiple imputations because multiple imputations lead to valid estimates including the uncertainty about the imputed values. When missing data are MNAR, also multiple imputations do not lead to valid results. A complication hereby is that it not possible to distinguish whether missing data are MAR or MNAR. Finally, it should be realized that preventing to have missing data is always better than the treatment of missing data. (cid:1) 2022 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Missing data mechanisms
Although researchers try to avoid missing data, these are present in almost every study. Ignoring missing data in statistical analysis can generate severely biased study results [1]. Rubin [2] was the first to develop a framework of different types of missing data (missing data mechanisms) that are important to determine the next steps in missing data handling. The three missing data mechanisms are missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). MCAR means that missing values are randomly distributed over the data sample. The reason for missing data is not related to relevant study variables or outcomes. For example, suppose a study in which people with familial hypertension are invited to come to the research center where blood pressure and several covariates are measured to investigate which covariates are related to blood pressure in this particular population. When data on blood pressure are missing, because some people were not able to visit the research center due to for instance a strike in public transport, these missing data are MCAR. MAR means that the probability of missing data is related to other variables. For example, when more data on blood pressure are missing of people with high body mass index, these missing data are MAR. MNAR is when the probability of missing data is dependent on the values of the variable itself. This is the case when people with the highest values for blood pressure do not visit the research center. This latter situation is problematic because you never know whether this is the case or not. When missing data are MNAR, there is no easy method to produce valid results. One possibility is to conduct several sensitivity analyses to study the influence of missing data on study outcomes [3,4]. It should be realized that the missing data mechanism is variable-dependent, that is, in one study, missing data on some of the variables can be MCAR, whereas for other variables missing data can be MAR or MNAR. Regarding the missing data mechanisms, it does not matter whether the particular variable with missing data is the outcome variable of the study or one of the covariates.

Exploring missing data
It should be realized that by definition, it is not possible to evaluate if the missing data are MAR or MNAR. The difference between the two is that when missing data are MNAR, missing data are related to unobserved data and because the data are unobserved and therefore unknown, it is impossible to evaluate whether the unobserved data are related to the missing data. There are, however, several possibilities to explore if the data are MCAR or not [5,6]. Ttests and logistic regression analyses can be used to investigate if there is a relationship between variables with and without missing data. The variable with missing data can be coded 0 for the observed and 1 for the missing data. When this variable (i.e., the missing data indicator variable) is used as a grouping variable in a t-test or as an outcome in a logistic regression analysis, the relationship with other variables can be explored. Another method that can be used is Little's MCAR test.

Methods to deal with missing data
There are different methods available on how to deal with missing data [7]. A method that is still commonly used is complete-case analysis (CCA), where all persons with missing values on one or more variables are excluded from the analysis. CCA has a lot of drawbacks and should be avoided in general [8]. Only in some, even MAR missing data situations, CCA may generate unbiased results. For instance, when only outcome data are missing and the analysis is adjusted for variables related to the missing outcome, CCA leads to unbiased results [9]. Furthermore, in longitudinal data analyses, when outcome data are missing in some of the repeated measures, an analysis on the available data will also provide valid results [10].
One of the mostly used methods to deal with missing data is imputation (replacement of missing data by real values). Single imputation methods such as mean imputation, imputation based on linear regression, or for longitudinal data, last value/observation carried forward are not recommended because most of these methods lead to an artificial decreased standard deviation in the variables to be analysed and, therefore, result in too small standard errors [7]. The recommended method is multiple imputation (MI) [11,12]. MI consists of three phases: imputation, analysis, and pooling. In the imputation phase, each missing value is replaced by several different values, which leads to multiple imputed datasets. The values used for imputation are derived from an imputation regression model. In this imputation regression model, variables that are related to the missing data and/or are correlated with the incomplete data variables (variables known as auxiliary variables) are used to 'predict' the missing value [13]. Additional noise is added to the predicted (imputed) values which guarantees spread in the imputed values. One advice that is sometimes overlooked is that the outcome variable has to be part of the imputation model [14]. Although several methods are available for generating the imputed values [15], the Multivariate Imputation by Chained Equations (MICE) procedure is mostly used and is implemented in standard software programs [12]. Within MI predictive mean matching is the preferred method [16]. Predictive mean matching uses observed values to impute missing values on basis of closest matches (nearest neighbors). This prevents the imputation of unrealistic values [16]. In the MI analysis phase, the different datasets are analyzed with the appropriate statistical method and in the pooling phase, the results are summarized into one final estimate as per Rubin's rules. The uncertainty about the missing data is reflected in the standard error of the pooled effect estimate [11].
As the imputation model is very important in MI, guidelines of how to specify it are available [12,16]. Furthermore, the implementation of postestimation pooling procedures for regression models and procedures as chi-squared and likelihood ratio tests [17] are increasingly developed for R software and can be found in packages as mice [18], miceafter [19], miceadds [20], and psfmi [21]. Table 1 gives a summary whether imputation is necessary and which imputation method should be used. First of all, it should be realized that when data are MCAR, complete case analysis is a less precise but still valid way to analyse the data. It is sometimes argued that also in MCAR situations, imputation should be used to increase the power of the analysis. That is a weak argument and should not be used in general to perform missing data imputation. As in all statistical methods, there are some guidelines about the percentage of missing data above which imputation is necessary. Mostly a missing data percentage of 5% is mentioned as a sort of cutoff. However, it should be realized that not only the percentage of missing data is important but also the strength of the relationship between missing and observed variables is important. Furthermore, it is suggested that MI can be used (or has to be used) even in situations with more than 50% missing data. However, when 50% or more of a particular variable is missing, it is highly questionable whether the available data of that particular variable are valid. In situations like that, it is maybe better to leave that particular variable out of the analyses. That does not have to be a big problem because in all studies some important variables are not measured at all.

Final remarks
Research on MI is ongoing and focuses currently among others on the development of imputation models for  [22,23], questionnaire data [24,25], costeffectiveness data [26], and the development and validation of prognostic models [27,28]. As missing data can seriously influence study outcome, they have to be well addressed. Guidelines on how to conduct a suitable missing value analysis and to choose a proper method to handle the missing data are currently within reach of every researcher [29e31]. There is therefore no excuse anymore to ignore missing data.

Key issues
Regarding missing data, prevention is always better than treatment.