Missing data should be handled differently for prediction than for description or causal explanation


      Missing data are much studied in epidemiology and statistics. Theoretical development and application of methods for handling missing data have mostly been conducted in the context of prospective research data and with a goal of description or causal explanation. However, it is now common to build predictive models using routinely collected data, where missing patterns may convey important information, and one might take a pragmatic approach to optimizing prediction. Therefore, different methods to handle missing data may be preferred. Furthermore, an underappreciated issue in prediction modeling is that the missing data method used in model development may not match the method used when a model is deployed. This may lead to overoptimistic assessments of model performance. For prediction, particularly with routinely collected data, methods for handling missing data that incorporate information within the missingness pattern should be explored and further developed. Where missing data methods differ between model development and model deployment, the implications of this must be explicitly evaluated. The trade-off between building a prediction model that is causally principled, and building a prediction model that maximizes the use of all available information, should be carefully considered and will depend on the intended use of the model.


      To read this article in full you will need to make a payment

      Purchase one-time access:

      Academic & Personal: 24 hour online accessCorporate R&D Professionals: 24 hour online access
      One-time access price info
      • For academic or personal research use, select 'Academic and Personal'
      • For corporate R&D use, select 'Corporate R&D Professionals'


      Subscribe to Journal of Clinical Epidemiology
      Already a print subscriber? Claim online access
      Already an online subscriber? Sign in
      Institutional Access: Sign in to ScienceDirect


        • Shmueli G.
        To explain or to predict?.
        Stat Sci. 2010; 25: 289-310
        • Rubin D.B.
        Inference and missing data.
        Biometrika. 1976; 63: 581-592
        • Ding Y.
        • Simonoff J.S.
        An investigation of missing data methods for classification trees applied to binary response data.
        J Mach Learn Res. 2010; 11: 131-170
        • Donders A.R.T.
        • van der Heijden G.J.M.G.
        • Stijnen T.
        • Moons K.G.M.
        Review: a gentle introduction to imputation of missing values.
        J Clin Epidemiol. 2006; 59: 1087-1091
        • Botsis T.
        • Hartvigsen G.
        • Chen F.
        • Weng C.
        Secondary use of EHR: data quality issues and informatics opportunities.
        Summit Transl Bioinform. 2010; 2010: 1-5
        • Steyerberg E.W.
        • van Veen M.
        Imputation is beneficial for handling missing data in predictive models.
        J Clin Epidemiol. 2007; 60: 979
        • Hippisley-Cox J.
        • Coupland C.
        • Brindle P.
        Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study.
        BMJ. 2017; 357: j2099
        • Choi J.
        • Dekkers O.M.
        • le Cessie S.
        A comparison of different methods to handle missing data in the context of propensity score analysis.
        Eur J Epidemiol. 2019; 34: 23-36
        • van der Heijden G.J.M.G.
        • Donders T.
        • Stijnen T.
        • Moons K.G.M.
        Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example.
        J Clin Epidemiol. 2006; 59: 1102-1109
        • Sharafoddini A.
        • Dubin J.A.
        • Maslove D.M.
        • Lee J.
        A new insight into missing data in intensive care unit patient profiles: observational study.
        JMIR Med Inform. 2019; 7: e11605
        • Qu Y.
        • Lipkovich I.
        Propensity score estimation with missing values using a multiple imputation missingness pattern (MIMP) approach.
        Stat Med. 2009; 28: 1402-1414
        • Seaman S.
        • White I.
        Inverse probability weighting with missing predictors of treatment assignment or missingness.
        Commun Stat Methods. 2014; 43: 3499-3515
        • Fletcher Mercaldo S.
        • Blume J.D.
        Missing data and prediction: the pattern submodel.
        Biostatistics. 2018; 21: 236-252
        • Groenwold R.H.H.
        • White I.R.
        • Donders A.R.T.
        • Carpenter J.R.
        • Altman D.G.
        • Moons K.G.M.
        Missing covariate data in clinical research: when and when not to use the missing-indicator method for analysis.
        CMAJ. 2012; 184: 1265-1269
        • Pullenayegum E.M.
        • Lim L.S.
        Longitudinal data subject to irregular observation: a review of methods with a focus on visit processes, assumptions, and study design.
        Stat Methods Med Res. 2014; (0962280214536537)
        • Alaa A.M.
        • Hu S.
        • van der Schaar M.
        Learning from clinical judgments: semi-markov-modulated marked Hawkes processes for risk prognosis.
        in: Proc 34th Int Conf Mach Learn. 2017
        • Goldstein B.A.
        • Navar A.M.
        • Pencina M.J.
        • Ioannidis J.P.
        Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review.
        J Am Med Inform Assoc. 2016; 24: 198-208
        • Wood A.M.
        • Royston P.
        • White I.R.
        The estimation and use of predictions for the assessment of model performance using large samples with multiply imputed data.
        BMJ. 2015; 57: 614-632
        • Sterne J.A.C.
        • White I.R.
        • Carlin J.B.
        • Spratt M.
        • Royston P.
        • Kenward M.G.
        • et al.
        Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls.
        BMJ. 2009; 338: b2393
        • Collins G.S.
        • Altman D.G.
        An independent and external validation of QRISK2 cardiovascular disease risk score: a prospective open cohort study.
        BMJ. 2010; 340: c2442
        • Janssen K.J.M.
        • Vergouwe Y.
        • Donders A.R.T.
        • Harrell F.E.
        • Chen Q.
        • Grobbee D.E.
        • et al.
        Dealing with missing predictor values when applying clinical prediction models.
        Clin Chem. 2009; 55: 994-1001
        • Saar-Tsechansky M.
        • Provost F.
        Handling missing values when applying classification models.
        J Mach Learn Res. 2007; 8: 1623-1657
        • Peek N.
        • Sperrin M.
        • Mamas M.
        • van Staa T.-P.
        • Buchan I.
        Hari seldon, QRISK3, and the prediction paradox.
        BMJ. 2017; 357: j2099
        • Luijken K.
        • Groenwold R.H.H.
        • Van Calster B.
        • Steyerberg E.W.
        • van Smeden M.
        Impact of predictor measurement heterogeneity across settings on the performance of prediction models: a measurement error perspective.
        Stat Med. 2019; 38: 3444-3459
        • Pajouheshnia R.
        • van Smeden M.
        • Peelen L.M.
        • Groenwold R.H.H.
        How variation in predictor measurement affects the discriminative ability and transportability of a prediction model.
        J Clin Epidemiol. 2019; 105: 136-141
        • Luijken K.
        • Wynants L.
        • Smeden M. van
        • Calster B. Van
        • Steyerberg E.W.
        • Groenwold R.H.H.
        • et al.
        Changing predictor measurement procedures affected the performance of prediction models in clinical examples.
        J Clin Epidemiol. 2020; 119: 7-18
        • Steyerberg E.W.
        • Moons K.G.
        • van der Windt D.A.
        • Hayden J.A.
        • Perel P.
        • Schroter S.
        • et al.
        Prognosis research strategy (PROGRESS) 3: prognostic model research.
        PLoS Med. 2013; 10: e1001381
        • Hernán M.A.
        • Hsu J.
        • Healy B.
        Data science is science’s second chance to get causal inference right: a classification of data science tasks.
        Chance. 2018; 32: 42-49
        • Sperrin M.
        • Martin G.P.
        • Pate A.
        • Van Staa T.
        • Peek N.
        • Buchan I.
        Using marginal structural models to adjust for treatment drop-in when developing clinical prediction models.
        Stat Med. 2018; 37: 4142-4154
        • Schulam P.
        • Saria S.
        Reliable decision support using counterfactual models.
        in: Advances in Neural Information Processing Systems. ArXiv E-Prints, 2017: 1697-1708
        • Jenkins D.A.
        • Sperrin M.
        • Martin G.P.
        • Peek N.
        Dynamic models to predict health outcomes: current status and methodological challenges.
        Diagn Progn Res. 2018; 2: 23