Advertisement
Original Article|Articles in Press

The DetectDeviatingCells algorithm was a useful addition to the toolkit for cellwise error detection in observational data.

  • Laura Viviani
    Correspondence
    Corresponding author: Laura Viviani, London School of Hygiene and Tropical Medicine, Faculty of Public Health and Policy, Department of Health Services Research and Policy, 15-17 Tavistock Place, London WC1H 9SH, United Kingdom,
    Affiliations
    London School of Hygiene and Tropical Medicine, Faculty of Public Health and Policy, Department of Health Services Research and Policy. 15-17 Tavistock Place, London, WC1H 9SH, United Kingdom
    Search for articles by this author
  • Ian R. White
    Affiliations
    Medical Research Council Clinical Trials Unit at University College London. 90 High Holborn, London, WC1V 6LJ, United Kingdom
    Search for articles by this author
  • Elizabeth J. Williamson
    Affiliations
    London School of Hygiene and Tropical Medicine, Faculty of Epidemiology and Population Health, Department of Medical Statistics. Keppel Street, London, WC1E 7HT, United Kingdom
    Search for articles by this author
  • James Carpenter
    Affiliations
    Medical Research Council Clinical Trials Unit at University College London. 90 High Holborn, London, WC1V 6LJ, United Kingdom

    London School of Hygiene and Tropical Medicine, Faculty of Epidemiology and Population Health, Department of Medical Statistics. Keppel Street, London, WC1E 7HT, United Kingdom
    Search for articles by this author
  • Jan van der Meulen
    Affiliations
    London School of Hygiene and Tropical Medicine, Faculty of Public Health and Policy, Department of Health Services Research and Policy. 15-17 Tavistock Place, London, WC1H 9SH, United Kingdom
    Search for articles by this author
  • David A. Cromwell
    Affiliations
    London School of Hygiene and Tropical Medicine, Faculty of Public Health and Policy, Department of Health Services Research and Policy. 15-17 Tavistock Place, London, WC1H 9SH, United Kingdom
    Search for articles by this author
Open AccessPublished:February 16, 2023DOI:https://doi.org/10.1016/j.jclinepi.2023.02.015
      This paper is only available as a PDF. To read, Please Download here.

      Highlights

      • Data quality controls can be time-consuming to carry out.
      • Analysts often need to trade-off false positives and false negatives of error detection methods.
      • Multivariable, robust, error detection methods had comparable performance for gross errors.
      • The DDC algorithm outperformed the other methods for more complex error patterns.
      • The DDC algorithm has the potential to improve error detection processes for observational data.

      Abstract

      Objective

      We evaluated the error detection performance of the DetectDeviatingCells (DDC) algorithm, which flags data anomalies at observation (casewise) and variable (cellwise) level in continuous variables. We compared its performance to other approaches in a simulated dataset.

      Study design and setting

      We simulated height and weight data for hypothetical individuals aged 2-20 years. We changed a proportion of height values according to pre-determined error patterns. We applied the DDC algorithm and other error-detection approaches (descriptive statistics, plots, fixed-threshold rules, classic and robust Mahalanobis distance) and we compared error detection performance with sensitivity, specificity, likelihood ratios, predictive values and ROC curves.

      Results

      At our chosen thresholds, error detection specificity was excellent across all scenarios for all methods and sensitivity was higher for multivariable and robust methods. The DDC algorithm performance was similar to other robust multivariable methods. Analysis of ROC curves suggested that all methods had comparable performance for gross errors (e.g. wrong measurement unit), but the DDC algorithm outperformed the others for more complex error patterns (e.g. transcription errors that are still plausible, although extreme).

      Conclusions

      The DDC algorithm has the potential to improve error detection processes for observational data.

      Keyword