Original article| Volume 136, P146-156, August 2021
• PDF [883 KB]PDF [883 KB]
• Top

# GRADE Guidance: 31. Assessing the certainty across a body of evidence for comparative test accuracy

Open AccessPublished:April 14, 2021

## Abstract

### Objectives

This article provides GRADE guidance on how authors of evidence syntheses and health decision makers, including guideline developers, can rate the certainty across a body of evidence for comparative test accuracy questions.

### Study design and setting

This guidance extends the previously published GRADE guidance for assessing certainty of evidence for test accuracy to scenarios in which two or more index tests are compared. Through an iterative brainstorm-discussion-feedback process within the GRADE working group, we developed a guidance accompanied by practical examples.

### Results

Rating the certainty of evidence for comparative test accuracy shares many concepts and ideas with the existing GRADE guidance for test accuracy. The rating in comparisons of test accuracy requires additional considerations, such as the selection of appropriate comparative study designs, additional criteria for judging risk of bias, and the consequences of using comparative measures of test accuracy. Distinct approaches to rating certainty are required for comparative test accuracy studies and between-study (indirect) comparisons.

### Conclusion

This GRADE guidance will support transparent assessment of the certainty for a body of comparative test accuracy evidence.

## Key-Words

### Key findings

• The approach to rating certainty of evidence for comparative test accuracy has similarities with, but also important differences from, the evaluation of accuracy evidence for individual tests. Most notable differences are in the selection of study designs, assessment of risk of bias, and the use of comparative measures of test accuracy.

### What this adds to what is known?

• Detailed guidance for rating certainty in comparative accuracy evidence did not yet exist. Such ratings require additional considerations, such as risk of bias criteria for appraising comparative test accuracy studies and rating the certainty of evidence separately for comparative test accuracy studies and between-study (indirect) comparisons.

### What is the implication, what should change now?

• Investigators who synthesize and assess the certainty of a body of evidence from comparative test accuracy studies for healthcare related tests and diagnostic strategies are recommended to use the present GRADE guidance.

## 1. Introduction

Recommendations regarding healthcare related tests and diagnostic strategies are ideally based on studies assessing the effect of alternative testing strategies on people-important outcomes [
• Schunemann H.J.
• Oxman A.D.
• Brozek J.
GRADE: grading quality of evidence and strength of recommendations for diagnostic tests and strategies.
]. However, if such studies are unavailable, test accuracy can be used as a surrogate to assess the likely impact of tests and strategies on people-important outcomes [
• Schünemann H.J.
• Mustafa R.
• Brozek J.
GRADE Guidelines: 16. GRADE evidence to decision frameworks for tests in clinical practice and public health.
].
The Grading of Recommendations Assessment, Development and Evaluation (GRADE) Working Group has previously described approaches on how to rate the certainty of evidence and develop recommendations for healthcare related tests and diagnostic strategies [
• Schunemann H.J.
• Oxman A.D.
• Brozek J.
GRADE: grading quality of evidence and strength of recommendations for diagnostic tests and strategies.
,
• Schünemann H.J.
• Mustafa R.
• Brozek J.
GRADE Guidelines: 16. GRADE evidence to decision frameworks for tests in clinical practice and public health.
,
• Hultcrantz M.
• Mustafa R.A.
• Leeflang M.M.G.
Defining ranges for certainty ratings of diagnostic accuracy: a GRADE concept paper.
,
• Schünemann H.J.
• Mustafa R.A.
• Brozek J.
GRADE Guidelines: 22. The GRADE approach for tests and strategies - from test accuracy to patient important outcomes and recommendations.
]. Recently, articles 21 part 1 and 2 in this series provided updated guidance on how to assess the certainty (also known as quality or confidence) of a body of evidence from test accuracy studies [
• Schünemann H.J.
• Mustafa R.A.
• Brozek J.
GRADE guidelines: 21 part 1. Study design, risk of bias, and indirectness in rating the certainty across a body of evidence for test accuracy.
,
• Schünemann H.J.
• Mustafa R.A.
• Brozek J.
GRADE guidelines: 21 part 2. Inconsistency, Imprecision, publication bias and other domains for rating the certainty of evidence for test accuracy and presenting it in evidence profiles and summary of findings tables.
]. This guidance addressed the importance of establishing the purpose of a test, how to frame clear healthcare questions, and reasons for downgrading or upgrading certainty of the evidence. It did not provide detailed guidance for rating certainty in comparisons of the accuracy of two or more tests.
Since development of recommendations requires the consideration of alternatives, studies of test accuracy are often more useful if they compare the performance of two or more tests. This article provides GRADE guidance for these scenarios, which we refer to as comparative test accuracy questions (see Box 1 for example). We briefly introduce the concept of comparative test accuracy, the major comparative study designs, and then focus on how each domain of certainty of evidence could be rated for comparative test accuracy questions.
This guidance was developed iteratively through brainstorming and discussion by the authors with feedback from the wider GRADE diagnosis project group. A rapid literature review of systematic reviews of comparative test accuracy was carried out to ensure no existing methods for rating certainty of evidence were missed (see Appendix A for details). Three teleconferences were held among the diagnosis project group members to scrutinize initial ideas for the guidance and to pilot these ideas on three examples presented in this paper. This paper was presented to the entire GRADE Working Group at a virtual meeting in June 2020 and after revisions, it was approved as GRADE guidance at a subsequent meeting in October 2020.

## 2. Comparative test accuracy questions

Many healthcare questions ask which of two promising tests – neither of which is the reference standard – has superior accuracy. For example, the World Health Organization recently compared two molecular tests for diagnosing drug resistant tuberculosis, Xpert and Xpert Ultra, with sputum culture as the reference standard [
World Health Organization
]. These comparisons of two or more index tests aim to estimate the absolute or relative difference in accuracy between the index tests (Fig. 1).
Here, the phrase ‘index test’ should be interpreted broadly, as it can be a combination of tests, or even a complete testing strategy. The phrase ‘reference standard’ should be interpreted as the best available method to verify whether an individual has the target condition or not [
• Schünemann H.J.
• Mustafa R.A.
• Brozek J.
GRADE guidelines: 21 part 1. Study design, risk of bias, and indirectness in rating the certainty across a body of evidence for test accuracy.
].
Estimation of comparative test accuracy ideally requires a comparative test accuracy study: an evaluation of competing index tests in a single study. However, in many bodies of evidence such studies may be rare or absent [
• Takwoingi Y.
• Leeflang M.M.G.
• Deeks J.J.
Empirical evidence of the importance of comparative studies of diagnostic test accuracy.
,
• Leeflang M.
• Treanor L.
Comparative reviews of diagnostic test accuracy in imaging research: evaluation of current practices.
]. In these cases, comparisons of test accuracy across studies, each evaluating a single index test (also referred to as between-study or indirect comparisons), may provide evidence of one test being more accurate than the other. However, these between-study comparisons will typically lead to low certainty of the evidence because differences in accuracy between index tests may be attributable to other factors, such as differences between the study groups (e.g., spectrum of diseased participants).

## 3. Certainty of evidence for comparative test accuracy

Before authors of evidence syntheses or guideline developers start to assess certainty, they need to make a choice about the target of their certainty of evidence rating by defining a threshold (not to be confused with thresholds for dichotomizing test results) or range [
• Hultcrantz M.
• Mustafa R.A.
• Leeflang M.M.G.
Defining ranges for certainty ratings of diagnostic accuracy: a GRADE concept paper.
]. This can be the null (i.e., no difference in test accuracy), a specified magnitude of difference, or a fully contextualized threshold based on relevant criteria in the GRADE Evidence to Decision (EtD) framework for tests [
• Schünemann H.J.
• Mustafa R.A.
• Brozek J.
GRADE Guidelines: 22. The GRADE approach for tests and strategies - from test accuracy to patient important outcomes and recommendations.
]. The two latter approaches require setting thresholds or ranges expressed in absolute terms for a given prevalence and back calculating the thresholds for test accuracy [
• Hultcrantz M.
• Mustafa R.A.
• Leeflang M.M.G.
Defining ranges for certainty ratings of diagnostic accuracy: a GRADE concept paper.
], for example difference in sensitivity and difference in specificity. The threshold or range will affect the ratings of imprecision and inconsistency as well as the other GRADE domains by considering the concept of certainty range (the range that characterizes the uncertainty of all five domains) [
• Schünemann H.J.
Interpreting GRADE's levels of certainty or quality of the evidence: GRADE for statisticians, considering review information size or less emphasis on imprecision?.
,
• Tikkinen K.A.O.
• Craigie S.
• Schünemann H.J.
• Guyatt G.H.
Certainty ranges facilitated explicit and transparent judgments regarding evidence credibility.
].

### 3.1 Study design

The initial GRADE certainty of the body of comparative test accuracy evidence starts at high, regardless of the study design. Below we describe how limitations in study design and execution impact on the certainty of a body of evidence. In valid comparative test accuracy studies, each participant undergoes all index tests and the reference standard (fully paired design). Alternatively, participants are randomly allocated to one of the index tests, followed by verification with the reference standard (randomized design) [[
• Bossuyt P.M.
• Irwig L.
• Craig J.
• Glasziou P.
Comparative accuracy: assessing new tests against existing diagnostic pathways.
,
• Olsen Maria
• Vali Yasaman
• Langendam Miranda
• Takwoingi Yemisi
• Hyde Christopher
• et al.
Study designs for comparative diagnostic test accuracy: a methodological review and classification scheme.
],]. These designs allow for the valid estimation of the accuracy of each index test, as well as the absolute or relative difference in accuracy between the index tests. While these two designs are the best known, alternative designs may be used that do not require the verification of all participants but nevertheless provide valid estimates of comparative test accuracy (see Appendix B). Unpaired or partially paired designs without random allocation also start at high certainty, but they may be rated down for risk of bias.

## 4. Reasons for down- or upgrading the certainty of evidence

Table 1 provides an overview of the factors that influence certainty of evidence for comparative test accuracy. We note again that this applies to scenarios when test accuracy studies provide the best available evidence and studies evaluating direct impact on people-important outcomes are not available [
• Schünemann H.J.
• Mustafa R.A.
• Brozek J.
GRADE Guidelines: 22. The GRADE approach for tests and strategies - from test accuracy to patient important outcomes and recommendations.
,
• Schünemann H.J.
• Mustafa R.A.
• Brozek J.
GRADE guidelines: 21 part 1. Study design, risk of bias, and indirectness in rating the certainty across a body of evidence for test accuracy.
,
• Schünemann H.J.
• Mustafa R.A.
• Brozek J.
GRADE guidelines: 21 part 2. Inconsistency, Imprecision, publication bias and other domains for rating the certainty of evidence for test accuracy and presenting it in evidence profiles and summary of findings tables.
]. In the initial part of this guidance, we will discuss the approach of assessing certainty when the body of evidence consists of comparative test accuracy studies. Subsequently (section 5), we discuss how the approach differs for between-study comparisons.
Table 1Overview of the factors that decrease/increase the certainty of evidence for comparative test accuracy
Domains of certainty of evidenceExplanationsDifferences from GRADE guidance for single test accuracy [
• Schünemann H.J.
• Mustafa R.A.
• Brozek J.
GRADE guidelines: 21 part 1. Study design, risk of bias, and indirectness in rating the certainty across a body of evidence for test accuracy.
,
• Schünemann H.J.
• Mustafa R.A.
• Brozek J.
GRADE guidelines: 21 part 2. Inconsistency, Imprecision, publication bias and other domains for rating the certainty of evidence for test accuracy and presenting it in evidence profiles and summary of findings tables.
]
Limitations in study design and execution or ‘risk of bias’Ideally, studies directly comparing the index tests in the same study group and setting (comparative test accuracy studies) should be considered. If comparative test accuracy studies are lacking, between-study comparisons of studies evaluating a single test may be considered as well; however, they will generally be rated down for indirectness. All study designs start at high certainty but may be rated down for the following reasons:
• Presence of flaws in study design and conduct that may bias the accuracy of an individual test
• Presence of additional flaws that are specific to test comparisons, for example:
• Participants undergoing index tests are unlikely to be comparable with regards to factors that affect test accuracy (for instance, disease severity). GRADE suggests rating down by one or two levels if unpaired, nonrandomized groups are compared within a study
• Index tests are possibly influenced by the knowledge or performance of their comparison tests
• The results of each index test are verified by a different reference standard
Using tools that allow assessing risk of bias in comparative test accuracy questions where two or more competing index tests are compared
Indirectness and applicabilityIndirectness can lower the certainty of evidence when:
• The population, intervention test (the index test), comparison test (the alternative index test), outcomes (test accuracy for the target condition, detected by the reference standard) of the studies are substantially different from those of the healthcare question
• The index tests being compared are evaluated in different studies (between-study or indirect comparison). GRADE suggests rating down by one or two levels for indirect comparisons.
Same criteria as for single test accuracy studies, using comparative measures
InconsistencyUnexplained inconsistency in comparative test accuracy (expressed as absolute difference, ratio, or odds ratio) can lower the certainty of evidence. For between-study comparisons, this judgement is more challenging as it requires (1) judging inconsistency for each index test separately and (2) inferring the inconsistency of their comparison.Same criteria as for single test accuracy studies, using comparative measures
ImprecisionCertainty of evidence is lower when the confidence intervals for comparative test accuracy cross a prespecified threshold or range.Same criteria as for single test accuracy studies, using comparative measures
Publication biasHigh suspicion of publication bias can lower the certainty of evidence:
• For-profit interest
• Knowledge of studies that exist but are not published
• Presence of only small studies with results suggesting implausibly large difference in accuracy
• Funnel plot asymmetry and tests for small-study effects
Same criteria as for single test accuracy studies, using comparative measures
Upgrading for dose effect, large effects, residual plausible bias and confoundingTemporary and limited guidance which requires additional research:
• A large difference in test accuracy between competing tests may alleviate concerns that the difference is due to bias
• Confounding or bias with a clearly predictable direction could increase certainty that one test has a higher/lower accuracy than its comparator
• How dose-response gradient may play a role in increasing certainty is yet unclear
Similar considerations as for single test accuracy

### 4.1 Risk of bias (limitations in study design and execution)

• Whiting P.F.
• Rutjes A.W.S.
• Westwood M.E.
QUADAS-2: A Revised Tool for the Quality Assessment of Diagnostic Accuracy Studies.
]) can be used to evaluate risk of bias in comparative test accuracy studies [

Yang B., Mallett S., Takwoingi Y., Development of QUADAS-C, a risk of bias tool for comparative diagnostic accuracy studies. doi:10.17605/OSF.IO/HQ8MF.

]. We are currently unaware of other risk of bias tools for comparative test accuracy studies. In addition to including appropriate design features of single test accuracy studies [
• Whiting P.F.
• Rutjes A.W.S.
• Westwood M.E.
QUADAS-2: A Revised Tool for the Quality Assessment of Diagnostic Accuracy Studies.
], comparative test accuracy studies should ensure that participants are comparable in terms of factors that affect test accuracy. This is achieved if each participant undergoes all index tests (‘pairing’), or by appropriately randomizing participants to one of the index tests and concealing the process. Index tests should be interpreted without knowledge of the results of the comparison tests, and index tests should not influence the performance of other index tests if used on the same participants. Furthermore, the results of each index test should be verified by the same reference standard. If one or more of these criteria are not fulfilled, the comparison may be at high risk of bias.
Of particular concern is when neither pairing nor randomization is used to select participants for each index test. This is the case when, for example, participants are assigned to receive index test A or B based on clinical indication. This is analogous to confounding by indication in nonrandomized studies of interventions and GRADE suggests rating down certainty by one or two levels depending on the raters’ judgement of the severity of the bias. Rating down by one level may be justified if – in the absence of other risk of bias issues – the raters’ judgement is that the most important confounding factors are controlled for in the design or in the analysis. Absence of measures to control for confounding in unpaired and nonrandomized studies generally warrants rating down by two levels in the absence of other risk of bias issues.

#### 4.1.1 Example for risk of bias

As an example, we look at the systematic review comparing the test accuracy of the HPV test and VIA for screening CIN 2-3 lesions [
• Mustafa R.A.
• Santesso N.
• Khatib R.
Systematic reviews and meta-analyses of the accuracy of HPV tests, visual inspection with acetic acid, cytology, and colposcopy.
]. Background information for this review is provided in Appendix C. More examples, along with their certainty ratings, can be found in Appendix D and E.
We judged all five fully paired comparative test accuracy studies of HPV vs. VIA to be at low risk of bias with QUADAS-C. Although one study demonstrated limitations in the ‘Flow and Timing’ domain (differential loss to follow-up in one study) and for one study there was unclear risk of bias in two domains, results of sensitivity analyses did not suggest a substantial difference in differential accuracy that could be attributed to a single item (overall judgment: no serious risk of bias; Fig. 2; Table 2).
Table 2Evidence profile for the HPV test versus VIA review
• Mustafa R.A.
• Santesso N.
• Khatib R.
Systematic reviews and meta-analyses of the accuracy of HPV tests, visual inspection with acetic acid, cytology, and colposcopy.
Certainty assessmentSummary of findings
OutcomeNo. of studiesStudy designRisk of biasIndirectnessInconsistencyImprecisionPublication biasNumber of results per 1000 women
In asymptomatic women with 2% prevalence of CIN grade 2-3
(95% CI)
Certainty of the body of evidence
True

positives
5 (406

women)
Fully

paired

studies
Not

serious
Not seriousSerious
Unexplained heterogeneity in effects ranging from one side of the threshold to the other
Serious
Confidence intervals of summary estimates ranged from recommending VIA to recommending HPV
Undetected19

(17 to 20)
14

(10 to 17)
5 more

(1 to 9 more)
⊕⊕◯◯

Low
False negatives1

(0 to 3)
6

(3 to 10)
5 fewer

(9 to 1 fewer)
True negatives5 (9113

women)
Fully

paired

studies
Not

serious
Not seriousSerious
Unexplained heterogeneity in effects ranging from one side of the threshold to the other
Not

serious
Although confidence intervals of summary estimates ranged from recommending VIA to recommending HPV, this was considered to be due to inconsistency. To avoid double counting, the certainty of evidence was not rated down for imprecision.
Undetected819

(716 to 887)
852

(765 to 908)
34 fewer (143 fewer to 76 more)⊕⊕⊕◯

Moderate
False positives161

(93 to 264)
128

(72 to 215)
34 more

(76 fewer to 143 more)
A fully contextualized threshold was used for rating certainty of evidence (see Appendix C for explanation). Numbers may not add up perfectly because of rounding.
a In asymptomatic women with 2% prevalence of CIN grade 2-3
b Unexplained heterogeneity in effects ranging from one side of the threshold to the other
c Confidence intervals of summary estimates ranged from recommending VIA to recommending HPV
d Although confidence intervals of summary estimates ranged from recommending VIA to recommending HPV, this was considered to be due to inconsistency. To avoid double counting, the certainty of evidence was not rated down for imprecision.

### 4.2 Indirectness

General principles for assessing indirectness in test accuracy studies have been described in previous GRADE articles [
• Schunemann H.J.
• Oxman A.D.
• Brozek J.
GRADE: grading quality of evidence and strength of recommendations for diagnostic tests and strategies.
,
• Schünemann H.J.
• Mustafa R.A.
• Brozek J.
GRADE guidelines: 21 part 1. Study design, risk of bias, and indirectness in rating the certainty across a body of evidence for test accuracy.
,
• Brozek J.L.
• Akl E.A.
• Jaeschke R.
Grading quality of evidence and strength of recommendations in clinical practice guidelines: Part 2 of 3. the GRADE approach to grading quality of evidence about diagnostic tests and strategies.
]. The assessment of indirectness for comparative test accuracy remains largely the same. The population, intervention test (index test), comparison test (alternative index test) and outcomes in the body of evidence should be similar to those of the healthcare question at hand.
One can rate down for population indirectness if there is sufficient concern that the absolute or relative difference in test accuracy will differ substantially in the studied population compared to the population of interest. For the HPV vs. VIA review, this could have been the case when most of studies had enrolled women in high-income countries, rather than women in low- or middle-income countries. Likewise, if the index test and/or the alternative index test substantially differ from the healthcare question in terms of type, conduct, or interpretation, one can rate down for indirectness. This may be the case if the HPV tests were inhouse tests, rather than manufactured and standardized tests as is usual in clinical practice. There can be indirectness in the outcome if the target condition (detected by the reference standard) is different from the target condition specified by the healthcare question. For example, for the HPV vs. VIA review, there would be indirectness if the target condition would include CIN 1 lesions, alongside CIN 2 and CIN 3 (Box 1).
Box 1Example comparative test accuracy question for this guidance paper
In women at risk for cervical intraepithelial neoplasia (CIN), how does the accuracy of human papillomavirus (HPV) test compare with that of visual inspection with acetic acid (VIA) for the diagnosis of CIN grade 2-3 lesions, verified by colposcopy with or without biopsy?
• Mustafa R.A.
• Santesso N.
• Khatib R.
Systematic reviews and meta-analyses of the accuracy of HPV tests, visual inspection with acetic acid, cytology, and colposcopy.
Population:Non-pregnant women aged 18 or older, not previously diagnosed or treated for CIN in low- and middle-income countries in a screening setting
Intervention test:HPV test (role is to replace VIA)
Comparison test:VIA
Outcome:Test accuracy for CIN 2-3 lesions, detected by the reference standard colposcopy with or without biopsy

#### 4.2.1 Example for indirectness

In our example, all five studies were done in low- and middle-income countries, consistent with our healthcare question (Box 1). However, the participants in two studies were symptomatic women presenting to the clinic, rather than women invited for screening. While it raised concerns about indirectness, we were uncertain whether this difference in setting would impact the difference in test accuracy between HPV and VIA. There were no additional concerns regarding indirectness as the HPV and VIA were performed as in routine practice and the target condition in all studies were CIN 2-3 lesions. Thus, we did not downgrade for indirectness (overall judgment: no serious indirectness; Table 2).

### 4.3 Inconsistency

Previous GRADE guidance explained that the rating of inconsistency is based on similarity of point estimates, extent of overlap of confidence intervals (CI) and statistical criteria for quantifying the extent of unexplained heterogeneity [
• Schünemann H.J.
• Mustafa R.A.
• Brozek J.
GRADE guidelines: 21 part 2. Inconsistency, Imprecision, publication bias and other domains for rating the certainty of evidence for test accuracy and presenting it in evidence profiles and summary of findings tables.
]. If the inconsistency is such that study-specific estimates are on either side of a predefined threshold, this warrants rating down for inconsistency.
When rating inconsistency and imprecision for comparative test accuracy, the study results and the threshold should be preferably expressed as comparative test accuracy measures, rather than as the accuracy of each index test (Box 2). We focus on the scenario in which a specific cut-off value is used for each test, and therefore we are able to express test accuracy as sensitivity and specificity. Comparisons of sensitivity and specificity can be expressed as absolute differences, ratios or odds ratios [
• Pepe M.S.
The Statistical Evaluation of Medical Tests for Classification and Prediction.
]. In this guidance paper, we used absolute differences in our examples because of simplicity in interpretation, especially when presented as natural frequencies. However, ratios and odds ratios are valid alternatives. More methodological work is needed to evaluate whether absolute measures are consistent across studies and subgroups.

### Box 2. Comparative measures of test accuracy

• A variety of measures can be used to express and compare test accuracy. When studies report a specific cut-off value for test positivity, accuracy can be expressed in terms of sensitivity and specificity. Related comparative test accuracy measures then include: the absolute difference in sensitivity and specificity, the relative sensitivity and specificity, the odds ratio of a (true) positive in the diseased (based on the sensitivities), and the odds ratio of a (true) negative result in the non-diseased (based on the specificities). Such measures are only valid for specific cut-off values for each test; they cannot be generalized to other cut-off values.
• If studies report multiple cut-off values for a test, it may be more challenging to compare and select a test. One method is based on the area under the receiver operating characteristic curve (AUC). If the receiver operating characteristic curves of the tests do not cross, the test with the higher AUC is to be preferred. A disadvantage of using a global measure of test accuracy, such as the AUC, is that the absolute effects of using a test on downstream health outcomes cannot easily be quantified. After first selecting the preferred test, a specific cut-off value can be chosen or result-specific likelihood ratios [
• Furukawa T.
• Straus S.
• Bucher H.
• Agoritsas T.
• Guyatt G.
Chapter 18: Diagnostic tests.
] can be derived for using the test in practice. In meta-analysis, test accuracy at specified cut-off values can be identified using recently developed models [
• Steinhauser S.
• Schumacher M.
• Rücker G.
Modelling multiple thresholds in meta-analysis of diagnostic test accuracy studies.
,
• Hoyer A.
• Hirt S.
• Kuss O.
Meta-analysis of full ROC curves using bivariate time-to-event models for interval-censored data.
,
• Jones H.E.
• Gatsonsis C.A.
• Trikalinos T.A.
• Welton N.J.
Quantifying how diagnostic test accuracy depends on threshold in a meta-analysis.
].$Differenceinsensitivity=SensitivityA-SensitivityB$$Differenceinspecificity=SpecificityA-SpecificityB$$Relativesensitivity=SensitivityA/SensitivityB$$Relativespecificity=SpecificityA/SpecificityB$$Oddsratioofsensitivity=SensitivityA/(1−SensitivityA)SensitivityB/(1−SensitivityB)$$Oddsratioofspecificity=SpecificityA/(1−SpecificityA)SpecificityB/(1−SpecificityB)$
It should be noted that rating inconsistency and imprecision is more challenging with ratios or odds ratios of sensitivity and specificity, because thresholds for ratios and odds ratios are more difficult to interpret. However, if the selected threshold is null (i.e., the ratio or odds ratio equals 1), rating inconsistency and imprecision is straightforward, regardless of choice of accuracy measure. The null can be a reasonable threshold for guidelines, if one index test has both superior sensitivity and specificity compared to the alternative test, and there are no disadvantages in using that test in terms of direct harms, cost, feasibility, and other considerations relative to the comparison test [
• Hultcrantz M.
• Mustafa R.A.
• Leeflang M.M.G.
Defining ranges for certainty ratings of diagnostic accuracy: a GRADE concept paper.
,
• Schünemann H.J.
• Mustafa R.A.
• Brozek J.
GRADE Guidelines: 22. The GRADE approach for tests and strategies - from test accuracy to patient important outcomes and recommendations.
,
• Mustafa R.A.
• Wiercioch W.
• Ventresca M.
Decision making about healthcare-related tests and diagnostic test strategies. Paper 5: a qualitative study with experts suggests that test accuracy data alone is rarely sufficient for decision making.
].

#### 4.3.1 Example for inconsistency

In our HPV vs. VIA example, the impact on downstream health outcomes of using HPV instead of VIA was modelled. We set the threshold as a ratio between sensitivity and specificity where an increase of 1% sensitivity makes a decrease in specificity of 0.9% acceptable, using inferred values of how women eligible for screening might value the direct and downstream outcomes. Although we express the threshold as a ratio between sensitivity and specificity, it is based on natural frequencies of false positives (FP) and false negatives (FN) at a prevalence of 2%. Details of how this threshold was set are described in Appendix C.
For the outcome true positive (TP) and FN (sensitivity), we rated down by one level because differences in sensitivity were heterogeneous with non-overlapping CIs, and the comparative test accuracy estimates ranged from one side of the threshold to the other (overall judgment: serious inconsistency; Fig. 2; Table 2). For the outcome true negative (TN) and FP (specificity), we also rated down by one level for the same reasons.

### 4.4 Imprecision

Imprecise estimates, identifiable as wide CIs around the summary estimate, can reduce the certainty of evidence [
• Schünemann H.J.
• Mustafa R.A.
• Brozek J.
GRADE guidelines: 21 part 2. Inconsistency, Imprecision, publication bias and other domains for rating the certainty of evidence for test accuracy and presenting it in evidence profiles and summary of findings tables.
]. How wide the interval should be to lower our certainty depends on the threshold used [
• Hultcrantz M.
• Mustafa R.A.
• Leeflang M.M.G.
Defining ranges for certainty ratings of diagnostic accuracy: a GRADE concept paper.
,
• Schünemann H.J.
• Mustafa R.A.
• Brozek J.
GRADE Guidelines: 22. The GRADE approach for tests and strategies - from test accuracy to patient important outcomes and recommendations.
]. If the CI crosses the threshold, it warrants rating down for imprecision. However, if the wide CI for the summary estimate is clearly due to between-study heterogeneity, one may choose rate down for inconsistency instead of imprecision.
As with inconsistency, the study results should be preferably expressed in comparative test accuracy measures when rating imprecision. However, systematic reviews of comparative test accuracy often present summary statistics separately for each index test. This presents a challenge for rating imprecision, as it is unclear how the threshold would have to be set for each index test. Therefore, authors of evidence syntheses should provide comparative test accuracy measures as summary statistics whenever possible.
The GRADE working group has previously suggested the optimal information size (required sample size for a single adequately powered study) as a criterion for rating imprecision, in addition to CIs [
• Guyatt G.H.
• Oxman A.D.
• Kunz R.
GRADE guidelines: 6. Rating the quality of evidence - Imprecision.
]. The optimal information size is intended to address skepticism regarding findings of small studies that suggest large effects with narrow CIs [
• Guyatt G.H.
• Oxman A.D.
• Kunz R.
GRADE guidelines: 6. Rating the quality of evidence - Imprecision.
]. It is yet unclear whether such skepticism should also apply to comparative test accuracy studies. Whether, when, and how raters should consider optimal information size when rating imprecision for comparative test accuracy evidence is therefore an open question and subject to further methodological research.

#### 4.4.1 Example for imprecision

Using the threshold described in section 4.3.1 to recommend HPV above VIA, we decided that there should be at least a 17% absolute gain in sensitivity and no more than 10% loss in specificity. The CI of the difference in mean sensitivity runs from 11% to 41%, leading us to downgrade by one level for TP and FN (overall judgment: serious imprecision; Fig. 2; Table 2). While the CI for the difference in mean specificity (-15% to 8%) was wide enough to cross the threshold, this could be explained by inconsistency (note that the number of participants for specificity was large, n=9113). Therefore, we did not downgrade for TN and FP (overall judgment: no serious imprecision).

### 4.5 Publication bias

There is little empirical research regarding publication bias in comparative test accuracy studies. Mechanisms that drive selective non-publication in studies of comparative effectiveness could also apply to studies of comparative test accuracy. For instance, developers of new tests may be reluctant to publish non-significant findings when testing the superiority of their test compared to an existing one. Until further research regarding publication bias in comparative test accuracy is available, raters should choose the prior GRADE guidance for judging publication bias [
• Schünemann H.J.
• Mustafa R.A.
• Brozek J.
GRADE guidelines: 21 part 2. Inconsistency, Imprecision, publication bias and other domains for rating the certainty of evidence for test accuracy and presenting it in evidence profiles and summary of findings tables.
]. These include for-profit interest, knowledge of unpublished studies, and presence of small studies with results suggesting implausibly large difference in accuracy. In addition, tests for small-study effects in test accuracy reviews may also be used, but these have not been validated for comparative test accuracy studies.

#### 4.5.1 Example for publication bias

We did not detect evidence of publication bias in our HPV vs. VIA example. There were no important concerns regarding for-profit interests in any of the included studies, and there were too few studies to construct a funnel plot and test for small-study effects (overall judgment: undetected; Table 2).

### 4.6 Reasons for upgrading the certainty of evidence

The GRADE working group previously identified three primary reasons for increasing the certainty in a body of evidence: (1) large magnitude of effect, (2) all plausible confounders or other biases increase our confidence, and (3) dose-response gradient [
• Schünemann H.J.
• Mustafa R.A.
• Brozek J.
GRADE guidelines: 21 part 2. Inconsistency, Imprecision, publication bias and other domains for rating the certainty of evidence for test accuracy and presenting it in evidence profiles and summary of findings tables.
]. The application of these principles to comparative test accuracy evidence requires further investigation; here we offer temporary and limited guidance.
If the magnitude of the difference in either sensitivity or specificity is very large, such that the possible range of summary estimates is beyond the specified threshold, raters may have greater confidence in the results which could mitigate concerns on another GRADE domain. For example, a precise and consistent ≥10% improvement in sensitivity and specificity in HPV over VIA, after considering all GRADE EtD criteria, might have been sufficient to further alleviate residual concerns about some risk of bias or indirectness in favor of such a recommendation. However, similar to studies of intervention effects, raters should exercise caution when rating up in the presence of study design features with potentially large biasing effects. For example, if one index test is part of the reference standard while the competing index test is not, we can expect large differences between these tests due to bias and rating up may be inappropriate.
Confounding or bias with a clearly predictable direction could increase certainty. For instance, a new test may have been compared to an existing test under unfavorable circumstances – which would have led to underestimation of the new test's sensitivity – but nevertheless demonstrate superior sensitivity relative to the existing test.
Whether and how dose-response gradients apply to comparative test accuracy questions is still unclear and needs to be explored further.

## 5. Considerations for between-study (indirect) comparisons

So far, we have described the approach of assessing certainty when bodies of evidence consist of comparative test accuracy studies. We speak of between-study comparisons if a comparison is made across studies, with each study evaluating only one of the tests being compared. The approach to between-study comparisons, as described in GRADE guidance 21 [
• Schünemann H.J.
• Mustafa R.A.
• Brozek J.
GRADE guidelines: 21 part 1. Study design, risk of bias, and indirectness in rating the certainty across a body of evidence for test accuracy.
], largely remains the same, but there are additional considerations for the domains risk of bias, indirectness, and inconsistency. See Appendix E for an example of a body of evidence consisting of between-study comparisons.

### 5.1 Risk of bias and indirectness in between-study comparisons

Ideally, a specific risk of bias tool designed to identify biases in between-study comparisons should be used, but such tools are currently not available; QUADAS-C, for example, is designed for comparative test accuracy studies. We distinguish two types of bias-related issues in between-study comparisons: (1) issues that arise within a study due to flaws in study design and conduct and (2) issues that arise because of flawed comparisons between different studies on the level of evidence synthesis. Although both issues are related to bias, GRADE considers the former as a risk of bias matter and the latter as an issue of indirectness.
When assessing risk of bias in studies that evaluate the accuracy of a single test, appropriate tools should be used, such as QUADAS-2 [
• Whiting P.F.
• Rutjes A.W.S.
• Westwood M.E.
QUADAS-2: A Revised Tool for the Quality Assessment of Diagnostic Accuracy Studies.
]. This will result in a risk of bias judgment for each test in the comparison. We suggest choosing the highest (i.e., the worst) risk of bias judgment to represent the risk of bias for the comparison.
Indirectness in the context of between-study comparisons refers to obtaining accuracy estimates for each test from different bodies of evidence and comparing them: a set of studies for one test and a fully separate set of studies for the second test, for example. While risk of bias can be addressed separately for the different bodies of evidence, we have an added concern regarding the indirect comparison (similar to indirect comparisons in interventions) with possible confounding due to differences in study group characteristics or in the reference standard. For this reason, between-study comparisons will typically lead to ratings lower than high certainty of the evidence. The indirectness tool [
• Schünemann H.J.
• Mustafa R.A.
• Brozek J.
GRADE guidelines: 21 part 1. Study design, risk of bias, and indirectness in rating the certainty across a body of evidence for test accuracy.
] includes a rating for indirect comparisons, which typically leads to rating down by one or two levels. Rating down by one level would be appropriate if the raters’ judgment is that the studies evaluating index test A and studies evaluating index test B are sufficiently similar with regards to factors that affect test accuracy. An example is when participants were sampled from the same region and setting, and similar eligibility criteria and reference standards were used. However, we expect rating down by two levels to be a more common scenario, as test accuracy studies in a systematic review are often very heterogeneous, even among studies that evaluate the same index test.

### 5.2 Inconsistency in between-study comparisons

In between-study comparisons, we are not interested in the inconsistency in estimates of each test's accuracy, but rather in the inconsistency of their comparison. However, studies in a between-study comparison can only estimate test accuracy for a single test, and inconsistency in comparative test accuracy estimates cannot be directly observed. Through a thought experiment – treating as if the tests are compared in the same study – we inferred how inconsistency for each test would influence the inconsistency in the comparison (Fig. 3). If inconsistency is observed for one or both tests, there is a greater probability that the comparison will be inconsistent (note that this cannot be directly observed but only inferred) and raters should consider rating down for this domain. We suggest a two-step procedure for assessing inconsistency for between-study comparisons: first rating inconsistency for each index test using established criteria [
• Schünemann H.J.
• Mustafa R.A.
• Brozek J.
GRADE guidelines: 21 part 2. Inconsistency, Imprecision, publication bias and other domains for rating the certainty of evidence for test accuracy and presenting it in evidence profiles and summary of findings tables.
] and second, inferring the inconsistency of their comparison. See example in Appendix E, where we have rated down for inconsistency. How the rating of inconsistency for each index test could be operationalized using thresholds is still a work in progress.

## 6. Bodies of evidence consisting of both comparative test accuracy studies and between-study comparisons

Systematic reviews addressing comparative test accuracy questions frequently include both comparative test accuracy studies (henceforth comparative studies) and between-study comparisons [
• Takwoingi Y.
• Partlett C.
• Riley R.D.
• Hyde C.
• Deeks J.J.
Methods and reporting of systematic reviews of comparative accuracy were deficient: a methodological survey and proposed guidance.
,
• Yang B.
• Vali Y.
Risk of bias assessment of test comparisons was uncommon in comparative accuracy systematic reviews: an overview of reviews.
]. How to best combine these types of evidence is a challenge that parallels the issue of integrating both randomized and nonrandomized studies for evidence synthesis of the effectiveness of interventions [
• Cuello-Garcia C.A.
• Morgan R.L.
• Brozek J.
A scoping review and survey provides the rationale, perceptions, and preferences for the integration of randomized and nonrandomized studies in evidence syntheses and GRADE assessments.
]. While additional methodological work is needed, we provide some preliminary recommendations below.
First, we recommend that raters assess the certainty of evidence in comparative studies and between-study comparisons separately. Second, if comparative studies constitute high certainty evidence, there is typically no need to look for between-study comparisons. Third, if the evidence from comparative studies is of moderate certainty or lower, we suggest assessing the certainty of between-study comparisons and to choose the highest certainty evidence to inform recommendations (although we expect between-study comparisons to provide moderate certainty evidence at best due to issues regarding indirectness). While the practice of combining comparative studies and between-study comparisons in a qualitative or quantitative synthesis is common [
• Yang B.
• Vali Y.
Risk of bias assessment of test comparisons was uncommon in comparative accuracy systematic reviews: an overview of reviews.
], we believe whether, when, and how these types of evidence can be integrated requires further exploration; it probably should be restricted to studies that have no or little concerns on the indirectness domain.

## 7. Conclusion

The existing GRADE guidance for assessing certainty of evidence from test accuracy studies can be extended to evidence regarding comparisons of two or more competing index tests. The methodology for conducting comparative test accuracy systematic reviews and HTA is an active area of research, including that of network meta-analyses of test accuracy, which will require further GRADE guidance.

## Acknowledgements

We are grateful to Dr. Robby Nieuwlaat (McMaster University), Dr. Carlos Cuello-Garcia (McMaster University), and Mr. Anthony Bozzo (McMaster University) for their valuable input in the discussions regarding various sections of this article.

## References

• Schunemann H.J.
• Oxman A.D.
• Brozek J.
GRADE: grading quality of evidence and strength of recommendations for diagnostic tests and strategies.
BMJ. 2008; 336: 1106-1110https://doi.org/10.1136/bmj.a139
• Schünemann H.J.
• Mustafa R.
• Brozek J.
GRADE Guidelines: 16. GRADE evidence to decision frameworks for tests in clinical practice and public health.
J Clin Epidemiol. 2016; 76: 89-98https://doi.org/10.1016/j.jclinepi.2016.01.032
• Hultcrantz M.
• Mustafa R.A.
• Leeflang M.M.G.
Defining ranges for certainty ratings of diagnostic accuracy: a GRADE concept paper.
J Clin Epidemiol. 2020; 117: 138-148https://doi.org/10.1016/j.jclinepi.2019.05.002
• Schünemann H.J.
• Mustafa R.A.
• Brozek J.
GRADE Guidelines: 22. The GRADE approach for tests and strategies - from test accuracy to patient important outcomes and recommendations.
J Clin Epidemiol. 2019; 111: 69-82https://doi.org/10.1016/j.jclinepi.2019.02.003
• Schünemann H.J.
• Mustafa R.A.
• Brozek J.
GRADE guidelines: 21 part 1. Study design, risk of bias, and indirectness in rating the certainty across a body of evidence for test accuracy.
J Clin Epidemiol. 2020; 122: 129-141https://doi.org/10.1016/j.jclinepi.2019.12.020
• Schünemann H.J.
• Mustafa R.A.
• Brozek J.
GRADE guidelines: 21 part 2. Inconsistency, Imprecision, publication bias and other domains for rating the certainty of evidence for test accuracy and presenting it in evidence profiles and summary of findings tables.
J Clin Epidemiol. 2020; 122: 142-152https://doi.org/10.1016/j.jclinepi.2019.12.021
• Mustafa R.A.
• Santesso N.
• Khatib R.
Systematic reviews and meta-analyses of the accuracy of HPV tests, visual inspection with acetic acid, cytology, and colposcopy.
Int J Gynecol Obstet. 2016; 132: 259-265https://doi.org/10.1016/j.ijgo.2015.07.024
• World Health Organization
WHO Meeting Report of a Technical Expert Consultation: Non-Inferiority Analysis of Xpert MTB/RIF Ultra Compared to Xpert MTB/RIF, Geneva2017 (https://www.who.int/tb/publications/2017/XpertUltra/en/#:~:text=The Technical Expert Group found,the detection of rifampicin resistance)
• Takwoingi Y.
• Leeflang M.M.G.
• Deeks J.J.
Empirical evidence of the importance of comparative studies of diagnostic test accuracy.
Ann Intern Med. 2013; 158: 544https://doi.org/10.7326/0003-4819-158-7-201304020-00006
• Leeflang M.
• Treanor L.
Comparative reviews of diagnostic test accuracy in imaging research: evaluation of current practices.
• Schünemann H.J.
Interpreting GRADE's levels of certainty or quality of the evidence: GRADE for statisticians, considering review information size or less emphasis on imprecision?.
J Clin Epidemiol. 2016; 75: 6-15https://doi.org/10.1016/j.jclinepi.2016.03.018
• Tikkinen K.A.O.
• Craigie S.
• Schünemann H.J.
• Guyatt G.H.
Certainty ranges facilitated explicit and transparent judgments regarding evidence credibility.
J Clin Epidemiol. 2018; 104: 46-51https://doi.org/10.1016/j.jclinepi.2018.08.014
• Bossuyt P.M.
• Irwig L.
• Craig J.
• Glasziou P.
Comparative accuracy: assessing new tests against existing diagnostic pathways.
BMJ. 2006; 332: 1089-1092https://doi.org/10.1136/bmj.332.7549.1089
• Whiting P.F.
• Rutjes A.W.S.
• Westwood M.E.
QUADAS-2: A Revised Tool for the Quality Assessment of Diagnostic Accuracy Studies.
Ann Intern Med. 2011; 155: 529https://doi.org/10.7326/0003-4819-155-8-201110180-00009
1. Yang B., Mallett S., Takwoingi Y., Development of QUADAS-C, a risk of bias tool for comparative diagnostic accuracy studies. doi:10.17605/OSF.IO/HQ8MF.

• Brozek J.L.
• Akl E.A.
• Jaeschke R.
Grading quality of evidence and strength of recommendations in clinical practice guidelines: Part 2 of 3. the GRADE approach to grading quality of evidence about diagnostic tests and strategies.
Allergy Eur J Allergy Clin Immunol. 2009; 64: 1109-1116https://doi.org/10.1111/j.1398-9995.2009.02083.x
• Pepe M.S.
The Statistical Evaluation of Medical Tests for Classification and Prediction.
Oxford University Press, 2003
• Furukawa T.
• Straus S.
• Bucher H.
• Agoritsas T.
• Guyatt G.
Chapter 18: Diagnostic tests.
in: Guyatt G. Rennie D. Meade M.O. Cook D.J. Users’ Guides to the Medical Literature: A Manual for Evidence-Based Clinical Practice. American Medical Association, 2014: 345-357 (3rd ed.)
• Steinhauser S.
• Schumacher M.
• Rücker G.
Modelling multiple thresholds in meta-analysis of diagnostic test accuracy studies.
BMC Med Res Methodol. 2016; 16: 1-15https://doi.org/10.1186/s12874-016-0196-1
• Hoyer A.
• Hirt S.
• Kuss O.
Meta-analysis of full ROC curves using bivariate time-to-event models for interval-censored data.
Res Synth Methods. 2018; 9: 62-72https://doi.org/10.1002/jrsm.1273
• Jones H.E.
• Gatsonsis C.A.
• Trikalinos T.A.
• Welton N.J.
Quantifying how diagnostic test accuracy depends on threshold in a meta-analysis.
Stat Med. 2019; 38: 4789-4803https://doi.org/10.1002/sim.8301
• Mustafa R.A.
• Wiercioch W.
• Ventresca M.
Decision making about healthcare-related tests and diagnostic test strategies. Paper 5: a qualitative study with experts suggests that test accuracy data alone is rarely sufficient for decision making.
J Clin Epidemiol. 2017; 92: 47-57https://doi.org/10.1016/j.jclinepi.2017.09.005
• Guyatt G.H.
• Oxman A.D.
• Kunz R.
GRADE guidelines: 6. Rating the quality of evidence - Imprecision.
J Clin Epidemiol. 2011; : 1283-1293https://doi.org/10.1016/j.zefq.2012.10.016
• Takwoingi Y.
• Partlett C.
• Riley R.D.
• Hyde C.
• Deeks J.J.
Methods and reporting of systematic reviews of comparative accuracy were deficient: a methodological survey and proposed guidance.
J Clin Epidemiol. 2020; 121: 1-14https://doi.org/10.1016/j.jclinepi.2019.12.007
• Yang B.
• Vali Y.
Risk of bias assessment of test comparisons was uncommon in comparative accuracy systematic reviews: an overview of reviews.
J Clin Epidemiol. 2020; 127: 167-174https://doi.org/10.1016/j.jclinepi.2020.08.007
• Cuello-Garcia C.A.
• Morgan R.L.
• Brozek J.
A scoping review and survey provides the rationale, perceptions, and preferences for the integration of randomized and nonrandomized studies in evidence syntheses and GRADE assessments.
J Clin Epidemiol. 2018; 98: 33-40https://doi.org/10.1016/j.jclinepi.2018.01.010