Original Article| Volume 139, P12-19, November 2021

Download started.


Variable selection methods were poorly reported but rarely misused in major medical journals: Literature review


      • In the “big five”.
      • Reporting about variable selection methods is insufficient.
      • Data-driven methods are not commonly used in causal explanatory models.
      • The addition of an adjustment variable is common in sensitivity analyses.


      Objective This work presents a review of the literature on reporting, practice and misuse of knowledge-based and data-driven variable selection methods, in five highly cited medical journals, considering recoding and interaction unlike previous reviews.
      Study Design and Setting Original observational studies with a predictive or explicative research question with multivariable analyses published in N. Engl. J. Med., Lancet, JAMA, Br. Med. J. and Ann. Intern. Med. between 2017 and 2019 were searched. Article screening was performed by a single reader, data extraction was performed by two readers and a third reader participated in case of disagreement. The use of data-driven variable selection methods in causal explicative questions was considered as misuse.
      Results 488 articles were included. The variable selection method was unclear in 234 (48%) articles, data-driven in 78 (16%) articles and knowledge-based in 176 (36%) articles. The most common data-driven methods were: Univariate selection (n = 22, 4.5%) and model comparisons or testing for interaction (n = 17, 3.5%). Data-driven methods were misused in 51 (10.5%) of articles.
      Conclusion Overall reporting of variable selection methods is insufficient. Data-driven methods seem to be used only in a minority of articles of the big five medical journals.


      To read this article in full you will need to make a payment

      Purchase one-time access:

      Academic & Personal: 24 hour online accessCorporate R&D Professionals: 24 hour online access
      One-time access price info
      • For academic or personal research use, select 'Academic and Personal'
      • For corporate R&D use, select 'Corporate R&D Professionals'


      Subscribe to Journal of Clinical Epidemiology
      Already a print subscriber? Claim online access
      Already an online subscriber? Sign in
      Institutional Access: Sign in to ScienceDirect


        • Greenland S
        • Pearl J
        • Robins JM.
        Causal diagrams for epidemiologic research. Epidemiol Camb Mass. 10. Jan 1999: 37-48
        • Hamaker HC.
        On multiple regression analysis.
        Stat Neerlandica. Mar 1962; 16: 31-56
        • Tibshirani R.
        Regression Shrinkage and Selection Via the Lasso.
        J R Stat Soc Ser B Methodol. Jan 1996; 58: 267-288
        • Zou H
        • Hastie T.
        Regularization and variable selection via the elastic net.
        J R Stat Soc Ser B Stat Methodol. Apr 2005; 67: 301-320
        • Dunkler D
        • Plischke M
        • Leffondré K
        • Heinze G.
        • Olivier J
        Augmented Backwasrd Elimination: A Pragmatic and Purposeful Way to Develop Statistical Models.
        PLoS ONE. 2014 Nov 21; 9e113677
        • Desboulets L.
        A Review on Variable Selection in Regression Analysis.
        Econometrics. 2018 Nov 23; 6: 45
        • Witte J
        • Didelez V.
        Covariate selection strategies for causal inference: Classification and comparison.
        Biom J Biom Z. Sep 2019; 61: 1270-1289
        • Heinze G
        • Wallisch C
        • Dunkler D.
        Variable selection - A review and recommendations for the practicing statistician.
        Biom J Biom Z. May 2018; 60: 431-449
        • Harrell FE.
        Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis.
        Second edition. Springer, Cham Heidelberg New York2015: 582 (Springer series in statistics)
        • Vandenbroucke JP
        • von Elm E
        • Altman DG
        • Gøtzsche PC
        • Mulrow CD
        • Pocock SJ
        • et al.
        Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): explanation and elaboration.
        Ann Intern Med. 2007 Oct 16; 147: W163-W194
        • Sharp MK
        • Bertizzolo L
        • Rius R
        • Wager E
        • Gómez G
        • Hren D.
        Using the STROBE statement: survey findings emphasized the role of journals in enforcing reporting guidelines.
        J Clin Epidemiol. Dec 2019; 116: 26-35
        • Walter S
        • Tiemeier H.
        Variable selection: current practice in epidemiological studies.
        Eur J Epidemiol. 2009; 24: 733-736
        • Talbot D
        • Massamba VK.
        A descriptive review of variable selection methods in four epidemiologic journals: there is still room for improvement.
        Eur J Epidemiol. Aug 2019; 34: 725-730
      1. 2016 Journal Impact Factor, Journal Citation Reports.
        Clarivate Analytics, 2020
        • for TG2 of the STRATOS initiative
        • Sauerbrei W
        • Perperoglou A
        • Schmid M
        • Abrahamowicz M
        • Becher H
        • et al.
        State of the art in selection of variables and functional forms in multivariable analysis—outstanding issues.
        Diagn Progn Res. Dec 2020; 4 (s41512-020-00074–3): 3
        • Bursac Z
        • Gauss CH
        • Williams DK
        • Hosmer DW.
        Purposeful selection of variables in logistic regression.
        Source Code Biol Med. Dec 16 2008; 3: 17
        • Schneeweiss S
        • Rassen JA
        • Glynn RJ
        • Avorn J
        • Mogun H
        • Brookhart MA.
        High-dimensional propensity score adjustment in studies of treatment effects using health care claims data.
        Epidemiol Camb Mass. Jul 2009; 20: 512-522
        • Sinisi SE
        • van der Laan MJ.
        Deletion/substitution/addition algorithm in learning with applications in genomics.
        Stat Appl Genet Mol Biol. 2004; 3: Article18
        • Loh W.
        Classification and regression trees.
        WIREs Data Min Knowl Discov. Jan 2011; 1: 14-23
        • Pencina MJ
        • D'Agostino RB
        • D'Agostino RB
        • Vasan RS
        Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond.
        Stat Med. Jan 30 2008; 27 (discussion 207-212): 157-172
        • Hainmueller J
        • Hazlett C.
        Kernel Regularized Least Squares: Reducing Misspecification Bias with a Flexible and Interpretable Machine Learning Approach.
        Polit Anal. 2014; 22: 143-168
        • Fiolet T
        • Srour B
        • Sellem L
        • Kesse-Guyot E
        • Allès B
        • Méjean C
        • et al.
        Consumption of ultra-processed foods and cancer risk: results from NutriNet-Santé prospective cohort.
        BMJ. Feb 14 2018; : k322
        • Zhong VW
        • Van Horn L
        • Cornelis MC
        • Wilkins JT
        • Ning H
        • Carnethon MR
        • et al.
        Associations of Dietary Cholesterol or Egg Consumption With Incident Cardiovascular Disease and Mortality.
        JAMA. Mar 19 2019; 321: 1081
        • Desai RJ
        • Bateman BT
        • Huybrechts KF
        • Patorno E
        • Hernandez-Diaz S
        • Park Y
        • et al.
        Risk of serious infections associated with use of immunosuppressive agents in pregnant women with autoimmune inflammatory conditions: cohort study.
        BMJ. Mar 6 2017; : j895
        • Timpka S
        • Stuart JJ
        • Tanz LJ
        • Rimm EB
        • Franks PW
        • Rich-Edwards JW.
        Lifestyle in progression from hypertensive disorders of pregnancy to chronic hypertension in Nurses’ Health Study II: observational cohort study.
        BMJ. Jul 12 2017; : j3024
        • Nelson SM
        • Haig C
        • McConnachie A
        • Sattar N
        • Ring SM
        • Smith GD
        • et al.
        Maternal thyroid function and child educational attainment: prospective cohort study.
        BMJ. Feb 20 2018; : k452
        • Helenius K
        • Longford N
        • Lehtonen L
        • Modi N
        • Gale C.
        Association of early postnatal transfer and birth outside a tertiary hospital with mortality and severe brain injury in extremely preterm infants: observational cohort study with propensity score matching.
        BMJ. 2019; 367: l5678
        • Wallis CJD
        • Juvet T
        • Lee Y
        • et al.
        Association Between Use of Antithrombotic Medication and Hematuria-Related Complications.
        JAMA. 2017; 318: 1260-1271
        • Thayakaran R
        • Adderley NJ
        • Sainsbury C
        • Torlinska B
        • Boelaert K
        • Šumilo D
        • et al.
        Thyroid replacement therapy, thyroid stimulating hormone concentrations, and long term health outcomes in patients with hypothyroidism: longitudinal study.
        BMJ. Sep 3 2019; : l4892
        • Abrahami D
        • Douros A
        • Yin H
        • Yu OHY
        • Renoux C
        • Bitton A
        • et al.
        Dipeptidyl peptidase-4 inhibitors and incidence of inflammatory bowel disease among patients with type 2 diabetes: population based cohort study.
        BMJ. Mar 21 2018; : k872
        • Lv Y-B
        • Gao X
        • Yin Z-X
        • Chen H-S
        • Luo J-S
        • Brasher MS
        • et al.
        Revisiting the association of blood pressure with mortality in oldest old people in China: community based, longitudinal prospective study.
        BMJ. Jun 5 2018; (k2158)
        • Koch B
        • Vock DM
        • Wolfson J.
        Covariate selection with group lasso and doubly robust estimation of causal effects: GLiDeR.
        Biometrics. Mar 2018; 74: 8-17
        • Shortreed SM
        • Ertefaie A.
        Outcome-adaptive lasso: Variable selection for causal inference.
        Biometrics. Dec 2017; 73: 1111-1122
        • Liao H
        • Lynn HS.
        A survey of variable selection methods in two Chinese epidemiology journals.
        BMC Med Res Methodol. Dec 2010; 10: 87