Abstract
Objectives
Study design and setting
Results
Conclusion
Key-Words
- •The approach to rating certainty of evidence for comparative test accuracy has similarities with, but also important differences from, the evaluation of accuracy evidence for individual tests. Most notable differences are in the selection of study designs, assessment of risk of bias, and the use of comparative measures of test accuracy.
Key findings
- •Detailed guidance for rating certainty in comparative accuracy evidence did not yet exist. Such ratings require additional considerations, such as risk of bias criteria for appraising comparative test accuracy studies and rating the certainty of evidence separately for comparative test accuracy studies and between-study (indirect) comparisons.
What this adds to what is known?
- •Investigators who synthesize and assess the certainty of a body of evidence from comparative test accuracy studies for healthcare related tests and diagnostic strategies are recommended to use the present GRADE guidance.
What is the implication, what should change now?
1. Introduction
2. Comparative test accuracy questions

3. Certainty of evidence for comparative test accuracy
3.1 Study design
- Yang Bada
- Olsen Maria
- Vali Yasaman
- Langendam Miranda
- Takwoingi Yemisi
- Hyde Christopher
- et al.
4. Reasons for down- or upgrading the certainty of evidence
Domains of certainty of evidence | Explanations | Differences from GRADE guidance for single test accuracy [ [5] ,[6] ] |
---|---|---|
Limitations in study design and execution or ‘risk of bias’ | Ideally, studies directly comparing the index tests in the same study group and setting (comparative test accuracy studies) should be considered. If comparative test accuracy studies are lacking, between-study comparisons of studies evaluating a single test may be considered as well; however, they will generally be rated down for indirectness. All study designs start at high certainty but may be rated down for the following reasons:
| Using tools that allow assessing risk of bias in comparative test accuracy questions where two or more competing index tests are compared |
Indirectness and applicability | Indirectness can lower the certainty of evidence when:
| Same criteria as for single test accuracy studies, using comparative measures |
Inconsistency | Unexplained inconsistency in comparative test accuracy (expressed as absolute difference, ratio, or odds ratio) can lower the certainty of evidence. For between-study comparisons, this judgement is more challenging as it requires (1) judging inconsistency for each index test separately and (2) inferring the inconsistency of their comparison. | Same criteria as for single test accuracy studies, using comparative measures |
Imprecision | Certainty of evidence is lower when the confidence intervals for comparative test accuracy cross a prespecified threshold or range. | Same criteria as for single test accuracy studies, using comparative measures |
Publication bias | High suspicion of publication bias can lower the certainty of evidence:
| Same criteria as for single test accuracy studies, using comparative measures |
Upgrading for dose effect, large effects, residual plausible bias and confounding | Temporary and limited guidance which requires additional research:
| Similar considerations as for single test accuracy |
4.1 Risk of bias (limitations in study design and execution)
Yang B., Mallett S., Takwoingi Y., Development of QUADAS-C, a risk of bias tool for comparative diagnostic accuracy studies. doi:10.17605/OSF.IO/HQ8MF.
4.1.1 Example for risk of bias

Certainty assessment | Summary of findings | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Outcome | No. of studies | Study design | Risk of bias | Indirectness | Inconsistency | Imprecision | Publication bias | Number of results per 1000 women (95% CI) | Certainty of the body of evidence | ||
HPV | VIA | Difference | |||||||||
True positives | 5 (406 women) | Fully paired studies | Not serious | Not serious | Serious | Serious | Undetected | 19 (17 to 20) | 14 (10 to 17) | 5 more (1 to 9 more) | ⊕⊕◯◯ Low |
False negatives | 1 (0 to 3) | 6 (3 to 10) | 5 fewer (9 to 1 fewer) | ||||||||
True negatives | 5 (9113 women) | Fully paired studies | Not serious | Not serious | Serious | Not serious | Undetected | 819 (716 to 887) | 852 (765 to 908) | 34 fewer (143 fewer to 76 more) | ⊕⊕⊕◯ Moderate |
False positives | 161 (93 to 264) | 128 (72 to 215) | 34 more (76 fewer to 143 more) |
4.2 Indirectness
- Brozek J.L.
- Akl E.A.
- Jaeschke R.
In women at risk for cervical intraepithelial neoplasia (CIN), how does the accuracy of human papillomavirus (HPV) test compare with that of visual inspection with acetic acid (VIA) for the diagnosis of CIN grade 2-3 lesions, verified by colposcopy with or without biopsy? [7] | |
---|---|
Population: | Non-pregnant women aged 18 or older, not previously diagnosed or treated for CIN in low- and middle-income countries in a screening setting |
Intervention test: | HPV test (role is to replace VIA) |
Comparison test: | VIA |
Outcome: | Test accuracy for CIN 2-3 lesions, detected by the reference standard colposcopy with or without biopsy |
4.2.1 Example for indirectness
4.3 Inconsistency
- A variety of measures can be used to express and compare test accuracy. When studies report a specific cut-off value for test positivity, accuracy can be expressed in terms of sensitivity and specificity. Related comparative test accuracy measures then include: the absolute difference in sensitivity and specificity, the relative sensitivity and specificity, the odds ratio of a (true) positive in the diseased (based on the sensitivities), and the odds ratio of a (true) negative result in the non-diseased (based on the specificities). Such measures are only valid for specific cut-off values for each test; they cannot be generalized to other cut-off values.
- If studies report multiple cut-off values for a test, it may be more challenging to compare and select a test. One method is based on the area under the receiver operating characteristic curve (AUC). If the receiver operating characteristic curves of the tests do not cross, the test with the higher AUC is to be preferred. A disadvantage of using a global measure of test accuracy, such as the AUC, is that the absolute effects of using a test on downstream health outcomes cannot easily be quantified. After first selecting the preferred test, a specific cut-off value can be chosen or result-specific likelihood ratios [[19]] can be derived for using the test in practice. In meta-analysis, test accuracy at specified cut-off values can be identified using recently developed models [[20],[21],[22]].
Box 2. Comparative measures of test accuracy
4.3.1 Example for inconsistency
4.4 Imprecision
4.4.1 Example for imprecision
4.5 Publication bias
4.5.1 Example for publication bias
4.6 Reasons for upgrading the certainty of evidence
5. Considerations for between-study (indirect) comparisons
5.1 Risk of bias and indirectness in between-study comparisons
5.2 Inconsistency in between-study comparisons

6. Bodies of evidence consisting of both comparative test accuracy studies and between-study comparisons
7. Conclusion
Acknowledgements
Appendix. Supplementary materials
References
- GRADE: grading quality of evidence and strength of recommendations for diagnostic tests and strategies.BMJ. 2008; 336: 1106-1110https://doi.org/10.1136/bmj.a139
- GRADE Guidelines: 16. GRADE evidence to decision frameworks for tests in clinical practice and public health.J Clin Epidemiol. 2016; 76: 89-98https://doi.org/10.1016/j.jclinepi.2016.01.032
- Defining ranges for certainty ratings of diagnostic accuracy: a GRADE concept paper.J Clin Epidemiol. 2020; 117: 138-148https://doi.org/10.1016/j.jclinepi.2019.05.002
- GRADE Guidelines: 22. The GRADE approach for tests and strategies - from test accuracy to patient important outcomes and recommendations.J Clin Epidemiol. 2019; 111: 69-82https://doi.org/10.1016/j.jclinepi.2019.02.003
- GRADE guidelines: 21 part 1. Study design, risk of bias, and indirectness in rating the certainty across a body of evidence for test accuracy.J Clin Epidemiol. 2020; 122: 129-141https://doi.org/10.1016/j.jclinepi.2019.12.020
- GRADE guidelines: 21 part 2. Inconsistency, Imprecision, publication bias and other domains for rating the certainty of evidence for test accuracy and presenting it in evidence profiles and summary of findings tables.J Clin Epidemiol. 2020; 122: 142-152https://doi.org/10.1016/j.jclinepi.2019.12.021
- Systematic reviews and meta-analyses of the accuracy of HPV tests, visual inspection with acetic acid, cytology, and colposcopy.Int J Gynecol Obstet. 2016; 132: 259-265https://doi.org/10.1016/j.ijgo.2015.07.024
- Geneva2017 (https://www.who.int/tb/publications/2017/XpertUltra/en/#:~:text=The Technical Expert Group found,the detection of rifampicin resistance) WHO Meeting Report of a Technical Expert Consultation: Non-Inferiority Analysis of Xpert MTB/RIF Ultra Compared to Xpert MTB/RIF,
- Empirical evidence of the importance of comparative studies of diagnostic test accuracy.Ann Intern Med. 2013; 158: 544https://doi.org/10.7326/0003-4819-158-7-201304020-00006
- Comparative reviews of diagnostic test accuracy in imaging research: evaluation of current practices.Eur Radiol. 2019; 29: 5386-5394https://doi.org/10.1007/s00330-019-06045-7
- Interpreting GRADE's levels of certainty or quality of the evidence: GRADE for statisticians, considering review information size or less emphasis on imprecision?.J Clin Epidemiol. 2016; 75: 6-15https://doi.org/10.1016/j.jclinepi.2016.03.018
- Certainty ranges facilitated explicit and transparent judgments regarding evidence credibility.J Clin Epidemiol. 2018; 104: 46-51https://doi.org/10.1016/j.jclinepi.2018.08.014
- Comparative accuracy: assessing new tests against existing diagnostic pathways.BMJ. 2006; 332: 1089-1092https://doi.org/10.1136/bmj.332.7549.1089
- QUADAS-2: A Revised Tool for the Quality Assessment of Diagnostic Accuracy Studies.Ann Intern Med. 2011; 155: 529https://doi.org/10.7326/0003-4819-155-8-201110180-00009
Yang B., Mallett S., Takwoingi Y., Development of QUADAS-C, a risk of bias tool for comparative diagnostic accuracy studies. doi:10.17605/OSF.IO/HQ8MF.
- Grading quality of evidence and strength of recommendations in clinical practice guidelines: Part 2 of 3. the GRADE approach to grading quality of evidence about diagnostic tests and strategies.Allergy Eur J Allergy Clin Immunol. 2009; 64: 1109-1116https://doi.org/10.1111/j.1398-9995.2009.02083.x
- The Statistical Evaluation of Medical Tests for Classification and Prediction.Oxford University Press, 2003
- Chapter 18: Diagnostic tests.in: Guyatt G. Rennie D. Meade M.O. Cook D.J. Users’ Guides to the Medical Literature: A Manual for Evidence-Based Clinical Practice. American Medical Association, 2014: 345-357 (3rd ed.)
- Modelling multiple thresholds in meta-analysis of diagnostic test accuracy studies.BMC Med Res Methodol. 2016; 16: 1-15https://doi.org/10.1186/s12874-016-0196-1
- Meta-analysis of full ROC curves using bivariate time-to-event models for interval-censored data.Res Synth Methods. 2018; 9: 62-72https://doi.org/10.1002/jrsm.1273
- Quantifying how diagnostic test accuracy depends on threshold in a meta-analysis.Stat Med. 2019; 38: 4789-4803https://doi.org/10.1002/sim.8301
- Decision making about healthcare-related tests and diagnostic test strategies. Paper 5: a qualitative study with experts suggests that test accuracy data alone is rarely sufficient for decision making.J Clin Epidemiol. 2017; 92: 47-57https://doi.org/10.1016/j.jclinepi.2017.09.005
- GRADE guidelines: 6. Rating the quality of evidence - Imprecision.J Clin Epidemiol. 2011; : 1283-1293https://doi.org/10.1016/j.zefq.2012.10.016
- Methods and reporting of systematic reviews of comparative accuracy were deficient: a methodological survey and proposed guidance.J Clin Epidemiol. 2020; 121: 1-14https://doi.org/10.1016/j.jclinepi.2019.12.007
- Risk of bias assessment of test comparisons was uncommon in comparative accuracy systematic reviews: an overview of reviews.J Clin Epidemiol. 2020; 127: 167-174https://doi.org/10.1016/j.jclinepi.2020.08.007
- A scoping review and survey provides the rationale, perceptions, and preferences for the integration of randomized and nonrandomized studies in evidence syntheses and GRADE assessments.J Clin Epidemiol. 2018; 98: 33-40https://doi.org/10.1016/j.jclinepi.2018.01.010
- Study designs for comparative diagnostic test accuracy: a methodological review and classification scheme.Journal of Clinical Epidemiology. 2021; https://doi.org/10.1016/j.jclinepi.2021.04.013
Article info
Publication history
Footnotes
Conflicts of interest: The authors are members of the GRADE working group. Bada Yang, Mariska Leeflang, Miranda Langendam and Patrick Bossuyt were involved in the development of the QUADAS-C tool.
Funding: Amsterdam UMC, the AMC foundation (The Netherlands), the Michael G. De Groote Cochrane Canada and McMaster GRADE centres provided funding for this project. The funding organizations had no role in the design, collection, analysis, and interpretation of the data or the decision to approve publication of the finished manuscript.
Identification
Copyright
User license
Creative Commons Attribution (CC BY 4.0) |
Permitted
- Read, print & download
- Redistribute or republish the final article
- Text & data mine
- Translate the article
- Reuse portions or extracts from the article in other works
- Sell or re-use for commercial purposes
Elsevier's open access license policy