If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Corresponding author. CLARITY Research Group, Department of Clinical Epidemiology & Biostatistics, Room 2C12, 1200 Main Street West Hamilton, Ontario L8N 3Z5, Canada. Tel.: +905-525-9140; fax: +905-524-3841.
German Cochrane Center, Institute of Medical Biometry and Medical Informatics, University Medical Center Freiburg, 79104 Freiburg, GermanyDepartment of Pediatric and Adolescent Medicine, Division of Pediatric Hematology and Oncology, University Medical Center Freiburg, 79106 Freiburg, Germany
This article is the first of a series providing guidance for use of the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) system of rating quality of evidence and grading strength of recommendations in systematic reviews, health technology assessments (HTAs), and clinical practice guidelines addressing alternative management options. The GRADE process begins with asking an explicit question, including specification of all important outcomes. After the evidence is collected and summarized, GRADE provides explicit criteria for rating the quality of evidence that include study design, risk of bias, imprecision, inconsistency, indirectness, and magnitude of effect.
Recommendations are characterized as strong or weak (alternative terms conditional or discretionary) according to the quality of the supporting evidence and the balance between desirable and undesirable consequences of the alternative management options. GRADE suggests summarizing evidence in succinct, transparent, and informative summary of findings tables that show the quality of evidence and the magnitude of relative and absolute effects for each important outcome and/or as evidence profiles that provide, in addition, detailed information about the reason for the quality of evidence rating.
Subsequent articles in this series will address GRADE’s approach to formulating questions, assessing quality of evidence, and developing recommendations.
Grading of Recommendations Assessment, Development, and Evaluation (GRADE) offers a transparent and structured process for developing and presenting summaries of evidence, including its quality, for systematic reviews and recommendations in health care.
GRADE provides guideline developers with a comprehensive and transparent framework for carrying out the steps involved in developing recommendations.
GRADE’s use is appropriate and helpful irrespective of the quality of the evidence: whether high or very low.
Although the GRADE system makes judgments about quality of evidence and strength of recommendations in a systematic and transparent manner, it does not eliminate the inevitable need for judgments.
In this, the first of a series of articles describing the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) approach to rating quality of evidence and grading strength of recommendations, we will briefly summarize what GRADE is, provide an overview of the GRADE process of developing recommendations, and present the endpoint of the GRADE evidence summary: the evidence profile (EP) and the summary of findings (SoFs) table. We will provide our perspective on GRADE’s limitations and present our plan for this series.
2. What is GRADE?
GRADE offers a system for rating quality of evidence in systematic reviews and guidelines and grading strength of recommendations in guidelines. The system is designed for reviews and guidelines that examine alternative management strategies or interventions, which may include no intervention or current best management. In developing GRADE, we have considered a wide range of clinical questions, including diagnosis, screening, prevention, and therapy. Most of the examples in this series are clinical examples. The GRADE system can, however, also be applied to public health and health systems questions.
GRADE is much more than a rating system. It offers a transparent and structured process for developing and presenting evidence summaries for systematic reviews and guidelines in health care and for carrying out the steps involved in developing recommendations. GRADE specifies an approach to framing questions, choosing outcomes of interest and rating their importance, evaluating the evidence, and incorporating evidence with considerations of values and preferences of patients and society to arrive at recommendations. Furthermore, it provides clinicians and patients with a guide to using those recommendations in clinical practice and policy makers with a guide to their use in health policy.
A common definition of guidelines refers to “systematically developed statements to assist practitioner and patient decisions about appropriate health care for specific clinical circumstances” [
]. This series will describe GRADE’s comprehensive approach to guideline development and to other similar guidance documents.
The optimal application of the GRADE approach requires systematic reviews of the impact of alternative management approaches on all patient-important outcomes. In the future, as specialty societies (e.g., American College of Physicians), national guideline developers and HTA agencies (e.g., National Institute for Health and Clinical Excellence), publishers (e.g., BMJ), publications (e.g., UpToDate), and international organizations (e.g., World Health Organization, Cochrane Collaboration) pool resources, high-quality evidence summaries will become increasingly available. As a result, even guideline panels with limited resources charged with generating recommendations for local consumption will be able to use GRADE to produce high-quality guidelines [
This series of articles about GRADE is most useful for three groups: authors of systematic reviews, groups conducting HTAs, and guideline developers. GRADE suggests somewhat different approaches for rating the quality of evidence for systematic reviews and for guidelines. HTA practitioners, depending on their mandate, can decide which approach is more suitable for their goals.
The GRADE approach is applicable irrespective of whether the quality of the relevant evidence is high or very low. Thus, all those who contribute to systematic reviews and HTA, or who participate in guideline panels, are likely to find this series informative. Consumers—and critics—of reviews and guidelines who desire an in-depth understanding of the evidence and recommendations they are using will also find the series of interest.
The series will provide a “how to” guide through the process of producing systematic reviews and guidelines, using examples to illustrate the concepts. We will not start with a broad overview of GRADE but rather assume that readers are familiar with the basics. Those who are not familiar may want to begin by reading a brief summary of the approach [
] that facilitate the development of EPs and SoFs tables provide a complement to this series.
4. The GRADE process—defining the question and collecting evidence
Figure 1 presents a schematic view of GRADE’s process for developing recommendations in which unshaded boxes describe steps in the process common to systematic reviews and guidelines and the shaded boxes describe steps that are specific to guidelines. One begins by defining the question in terms of the populations, alternative management strategies (an intervention, sometimes experimental and a comparator, sometimes standard care), and all patient-important outcomes (in this case four) [
]. For guidelines, one classifies those outcomes as either critical (two outcomes in the figure) or important but not critical (two outcomes). A systematic search leads to inclusion of relevant studies (in this schematized presentation, five such studies).
Systematic review or guideline authors then use the data from the individual eligible studies to generate a best estimate of the effect on each patient-important outcome and an index (typically a confidence interval [CI]) of the uncertainty associated with that estimate.
5. The GRADE process—rating evidence quality
In the GRADE approach, randomized controlled trials (RCTs) start as high-quality evidence and observational studies as low-quality evidence supporting estimates of intervention effects. Five factors may lead to rating down the quality of evidence and three factors may lead to rating up (Fig. 2). Ultimately, the quality of evidence for each outcome falls into one of four categories from high to very low.
Systematic review and guideline authors use this approach to rate the quality of evidence for each outcome across studies (i.e., for a body of evidence). This does not mean rating each study as a single unit. Rather, GRADE is “outcome centric”: rating is made for each outcome, and quality may differ—indeed, is likely to differ—from one outcome to another within a single study and across a body of evidence.
For example, in a series of unblinded RCTs measuring both the occurrence of stroke and all-cause mortality, it is possible that stroke—much more vulnerable to biased judgments—will be rated down for risk of bias, whereas all-cause mortality will not. Similarly, a series of studies in which very few patients are lost to follow-up for the outcome of death, and very many for the outcome of quality of life, is likely to result in judgments of lower quality for the latter outcome. Problems with indirectness may lead to rating down quality for one outcome and not another within a study or studies if, for example, fracture rates are measured using a surrogate (e.g., bone mineral density) but side effects are measured directly.
6. The GRADE process—grading recommendations
Guideline developers (but not systematic reviewers) then review all the information to make a final decision about which outcomes are critical and which are important and come to a final decision regarding the rating of overall quality of evidence.
Guideline (but not systematic review) authors then consider the direction and strength of recommendation. The balance between desirable and undesirable outcomes and the application of patients’ values and preferences determine the direction of the recommendation and these factors, along with the quality of the evidence, determine the strength of the recommendation. Both direction and strength may be modified after taking into account the resource use implications of the alternative management strategies.
7. The endpoint of the GRADE process
The endpoint for systematic reviews and for HTA restricted to evidence reports is a summary of the evidence—the quality rating for each outcome and the estimate of effect. For guideline developers and HTA that provide advice to policymakers, a summary of the evidence represents a key milestone on the path to a recommendation.
The GRADE working group has developed specific approaches to presenting the quality of the available evidence, the judgments that bear on the quality rating, and the effects of alternative management strategies on the outcomes of interest. We will now summarize these approaches, which we call the GRADE EP and the SoFs table. In doing so, we are taking something of a “flashback” approach to this series of articles: we begin by presenting the conclusion of the evidence summary process and will then go back to describe in detail the steps that are required to arrive at that conclusion.
8. What is the difference between an EP and a SoFs table?
An EP (Table 1) includes a detailed quality assessment in addition to a SoFs. That is, the EP includes an explicit judgment of each factor that determines the quality of evidence for each outcome (Fig. 2), in addition to a SoFs for each outcome. The SoF table (Table 2) includes an assessment of the quality of evidence for each outcome but not the detailed judgments on which that assessment is based.
Table 1GRADE evidence profile: antibiotics for children with acute otitis media
The basis for the control risk is the median control group risk across studies. The intervention risk (and its 95% CI) is based on the control risk in the comparison group and the relative effect of the intervention (and its 95% CI).
Pain at 24 h
367 per 1,000
330 per 1,000 (286–382)
RR 0.9 (0.78–1.04)
Pain at 2–7 d
257 per 1,000
185 per 1,000 (159–213)
RR 0.72 (0.62–0.83)
Hearing, inferred from the surrogate outcome abnormal tympanometry—1 mo
Generally, GRADE rates down for inconsistency in relative effects (which are not inconsistent in this case). Inconsistency here is in absolute effects, which range from 1% to 56%. Contributing factors to the decision to rate down in quality include the likely variation between antibiotics and the fact that most of the adverse events come from a single study. Consideration of indirect evidence from other trials of antibiotics in children (not undertaken) would likely further inform this issue.
Ideally, evidence from nonotitis trials with similar ages and doses (not obtained) might improve the quality of the evidence.
Abbreviations: CI, confidence interval; RR, risk ratio; GRADE, Grading of Recommendations Assessment, Development, and Evaluation.
a The basis for the control risk is the median control group risk across studies. The intervention risk (and its 95% CI) is based on the control risk in the comparison group and the relative effect of the intervention (and its 95% CI).
b Because of indirectness of outcome.
c Generally, GRADE rates down for inconsistency in relative effects (which are not inconsistent in this case). Inconsistency here is in absolute effects, which range from 1% to 56%. Contributing factors to the decision to rate down in quality include the likely variation between antibiotics and the fact that most of the adverse events come from a single study. Consideration of indirect evidence from other trials of antibiotics in children (not undertaken) would likely further inform this issue.
The EP and the SoF table serve different purposes and are intended for different audiences. The EP provides a record of the judgments that were made by review or guideline authors. It is intended for review authors, those preparing SoF tables and anyone who questions a quality assessment. It helps those preparing SoF tables to ensure that the judgments they make are systematic and transparent and it allows others to inspect those judgments. Guideline panels should use EPs to ensure that they agree about the judgments underlying the quality assessments and to establish the judgments recorded in the SoF tables.
SoF tables are intended for a broader audience, including end users of systematic reviews and guidelines. They provide a concise summary of the key information that is needed by someone making a decision and, in the context of a guideline, provide a summary of the key information underlying a recommendation. GRADEpro computer software facilitates the process of developing both EPs and SoFs tables [
9. More than one systematic review may be needed for a single recommendation
Figure 1 illustrates that evidence must be summarized—the summaries ideally coming from optimally conducted systematic reviews—for each patient-important outcome. For each comparison of alternative management strategies, all outcomes should be presented together in one EP or SoFs table. It is likely that all studies relevant to a health care question will not provide evidence regarding every outcome. Figure 1, for example, shows the first study providing evidence for the first and second outcome, the second study for the first three outcomes, and so on. Indeed, there may be no overlap between studies providing evidence for one outcome and those providing evidence for another. For instance, RCTs may provide the relevant evidence for benefits and observational studies for rare, serious adverse effects.
Because most existing systematic reviews do not adequately address all relevant outcomes (many, for instance, are restricted to RCTs), the GRADE process may require relying on more than one systematic review. Ideally, future systematic reviews will comprehensively summarize evidence on all important outcomes for a relevant question.
10. A single systematic review may need more than one SoFs table
Systematic reviews often address more than one comparison. They may evaluate an intervention in two disparate populations or examine the effects of a number of interventions. Such reviews are likely to require more than one SoFs table. For example, a review of influenza vaccines may evaluate the effectiveness of vaccination for different populations, such as community dwelling and institutionalized elderly patients or for different types of vaccines.
11. An example of an EP
Table 1 presents an example of a GRADE EP addressing the desirable and undesirable consequences of use of antibiotics for children with otitis media living in high- and middle-income countries. The most difficult judgment in this table relates to the quality of evidence regarding adverse effects of antibiotics. In relative terms, the increases in adverse effects were reasonably consistent across trials. The trials, however, had very different rates of adverse effects (from 1% to 56%). Furthermore, from evidence external to the trials, we know that adverse effects differ across drugs (amoxicillin causes more adverse effects than penicillin). In addition, most of the events driving the increase come from a single trial which, of those included, had the highest risk of bias. The investigators recognized that ideally they would generate a summary of adverse effects from nonotitis trials with similar drug doses and patient age. Ultimately, they chose to rate down quality from high (starting high because the evidence comes from randomized trials) to moderate quality on the basis of inconsistency in absolute effects.
This dilemma faced by the investigators in making their rating of quality of evidence for adverse effects highlights two themes that will recur throughout this series. First, for many close-call judgments that are required in evaluating evidence, disagreement between reasonable individuals will be common. GRADE allows the pinpointing of the nature of the disagreement. Decision makers are then in a position to make their own judgments about the relevant issues.
Second, GRADE asks systematic review authors and guideline developers to consider quality of evidence under a number of discrete categories and to either rate down or not on the basis of each category (Fig. 2). Rigid adherence to this approach, however, ignores the fact that quality is actually a continuum and that an accumulation of limitations across categories can ultimately provide the impetus for rating down in quality. Ultimately, GRADE asks authors who decide to rate down quality by a single level to specify the one category most responsible for their decision (in this case, inconsistency of absolute effects) while documenting (as in the previous paragraph and in the footnotes in Table 1, Table 2), all factors that contributed to the final decision to rate down quality.
This presentation and the EP (Table 1) and SoF table (Table 2) illustrate another point: although we suggest standard formats based on pilot testing, user testing, and evaluations [
], alternative formats may be desirable for different audiences. Indeed, the order of the columns and the presentation of the absolute risks differs in the EP and SoF we present in this article.
In subsequent articles, we will continue to present examples of different formats for these tables. For both EPs and SoF tables, there is a trade-off between consistency, which facilitates their use and adaptation to address specific audiences or characteristics of the evidence, for example, by leaving out columns for some elements of the quality assessment or presenting the findings in a different way. Furthermore, EPs and SoF tables focusing on continuous variables and those addressing diagnostic questions may require a different format. Finally, the user testing conducted thus far is limited, and further testing may generate differing findings.
We suggest, however, that a few items should be included in all evidence summaries. For example, all EPs should include a row for each patient-important outcome. Typically, each row should include columns for the number of studies and the number of participants, the study design (randomized trials or observational studies), relevant factors that determine the evidence quality (Fig. 2), the overall judgment of quality (high, moderate, low, or very low) for that outcome, and estimates for the relative and absolute effects of the intervention.
12. An example of a SoFs table
Table 2 presents a SoF table in the format we recommend on the basis of pilot testing, user testing, and evaluations [
]. The Appendix presents an explanation of the terms found in the SoF table and the EP.
A SoF table presents the same information as the full EP, omitting the details of the quality assessment and adding a column for comments. The logic of the order of the columns is their importance—more important in the first columns and less important in the later. Aside from a different order of columns, the SoF table (Table 2) presents the absolute risks in intervention and control groups with a CI around the intervention group rate, while the EP (Table 1) presents the risk difference with an associated CI. In addition, for nonsignificant outcomes (e.g., hearing, inferred from the surrogate outcome tympanometry) for the absolute risk difference, the EP notes only that results are nonsignificant, whereas the SoF table provides a CI around the intervention event rate.
The suggested format for SoF tables represents a compromise between simplicity (to make the information as easily accessible as possible to a wide audience) and completeness (to make the information and the underlying judgments as transparent as possible). When this format is used, judgments must still be made about what information to present (e.g., which outcomes and what levels of risk) and how to present that information (e.g., how to present continuous outcomes). As we have noted, although we encourage the use of this or a similar format and consistency, those preparing SoF tables should consider their target audience and the specific characteristics of the underlying evidence when deciding on the optimal format for a SoF table. Future editions of GRADEpro will include additional options for the preparation of EPs and SoF tables reflecting this flexibility [
Some organizations have used modified versions of the GRADE approach. We recommend against such modifications because the elements of the GRADE process are interlinked because modifications may confuse some users of evidence summaries and guidelines, and because such changes compromise the goal of a single system with which clinicians, policy makers, and patients can become familiar.
14. GRADE’s Limitations
Those who want to use GRADE should consider five important limitations of the GRADE system. First, as noted previously, GRADE has been developed to address questions about alternative management strategies, interventions, or policies. It has not been developed for questions about risk or prognosis, although evidence regarding risk or prognosis may be relevant to estimating the magnitude of intervention effects or providing indirect evidence linking surrogate to patient-important outcomes.
Second, attempted application of GRADE to an ill-defined set of recommendations that one may call “motherhood statements” or “good practice recommendations” will prove problematic. A guideline panel may want to issue such recommendations relating to interventions that represent necessary and standard procedures of the clinical encounter or health care system—such as history taking and physical examination, helping patients to make informed decisions, obtaining written consent, or the importance of good communication. Some of these recommendations may not be helpful, and when they are helpful, it may not be a useful exercise to rate the quality of evidence or grade the strength of the recommendations. Other recommendations may be confused with good practice recommendations but may in fact require grading.
Recommendations that are unhelpful include those that are too vague to be implemented (e.g., “take a comprehensive history” or “complete a detailed physical examination”). Some interpretations of such recommendations might lead to inefficient or counterproductive behavior. Guideline panels should issue recommendations only when they are both specific and actionable.
Recommendations that may be helpful but do not need grading are typically those in which it is sufficiently obvious that desirable effects outweigh undesirable effects that no direct evidence is available because no one would be foolish enough to conduct a study addressing the implicit clinical question. Typically, such recommendations are supported by a great deal of indirect evidence, but teasing out the nature of the indirect evidence would be challenging and a waste of time and energy. One way of recognizing such questions is that if one made the alternative explicit, it would be bizarre or laughable.
Procedures may be sufficiently ingrained in standard clinical practice that guideline panels would be inclined to consider them good practice recommendations when in fact a dispassionate consideration would suggest that legitimate doubt remains regarding the balance of desirable and undesirable consequences. Such recommendations should undergo formal rating of quality of evidence and grading of strength of recommendations. Table 3 provides examples of unhelpful good practice recommendations, helpful good practice recommendations, and recommendations that might be confused with good practice recommendations but require rating of quality of evidence and grading of recommendations.
Table 3Examples of best practice statements and statements that could be confused with motherhood statements
Recommendations that are not helpful
Recommendations that may be helpful but do not need grading
Recommendations that need grading
In patients presenting with chronic heart failure, take a careful and detailed history and perform a clinical examination.
“Careful and detailed history” is neither specific nor actionable.
In patients presenting with heart failure, initial assessment should be made of the patient’s ability to perform routine/desired activities of daily living (LOE: C).
The alternative: initial assessment excluding ascertainment of ability to perform routine activities is not credible.
In patients with hypertension, the PE should include auscultation for carotid, abdominal, and femoral bruits.
This recommendation is specific, but may be a waste of time, or lead to positive results that lead to fruitless, resource-consuming investigation.
In patients with hypertension, the PE should include an appropriate measurement of BP, with verification in the contralateral arm.
It is not clear what exactly the authors mean by “appropriate measurement of BP.”
Pregnant women should be offered evidence-based information and support to enable them to make informed decisions regarding their care, including details of where they will be seen and who will undertake their care (LOE: C).
Most would consider a recommendation to not offer such information a violation of basic standards of care.
In patients with diabetes, monofilaments should not be used to test more than 10 patients in one session and should be left for at least 24 h to “recover” (buckling strength) between sessions (LOE: C).
If there is only very low-quality evidence to support such a recommendation, clinicians should be aware of this, and the recommendation should be weak.
All patients should undergo PE to define the severity of the hospital-acquired pneumonia, to exclude other potential sources of infection and to reveal specific conditions that can influence the likely etiologic pathogens (level II).
The elements of a PE that are necessary to reveal conditions that can influence the likely pathogens is uncertain.
Routinely record the daytime activities of people with schizophrenia in their care plans, including occupational outcomes.
A recommendation to omit recording such activities is not credible.
Monitoring for the development of diabetes in those with prediabetes should be performed every year (LOE: E).
The alternative should be specified (is it more frequently, less frequently, or not at all?). Specifying the alternative would make it evident that formal grading is desirable.
In patients presenting with a seizure, a PE (including cardiac, neurological, and mental state) and developmental assessment, where appropriate, should be carried out (LOE: C).
It is unclear what makes the particular aspects of PE or developmental assessment appropriate.
When working with caregivers of people with schizophrenia: provide written/verbal information on schizophrenia and its management, including how families/caregivers can help through all phases of treatment.
Although randomized trials of specific educational programs may be warranted, a trial in which the basic information described here is withheld would be unacceptable.
Perform the A1C test at least two times a year in patients who are meeting treatment goals (and who have stable glycemic control) (LOE: E).
The alternative should be specified (is it more frequently, less frequently, or not at all?). Specifying the alternative would make it evident that formal grading is desirable.
Health care professionals should facilitate access as soon as possible to assessment/treatment and promote early access throughout all phases of care.
The specific actions required to facilitate access are not specified and thus obscure.
Third, as illustrated in Fig. 3, preparing a guideline entails several steps both before and after those steps to which the GRADE system applies. It is important for review authors and guideline developers to understand where GRADE fits into the overall process and to look elsewhere for guidance related to those other steps [
]. We do, however, in later articles in this series, provide our view of how the GRADE system is best implemented in the context of these other steps.
Fourth, the overwhelming experience with GRADE thus far is in evaluation of preventive and therapeutic interventions and in addressing clinical questions rather than public health and health systems questions. Those applying GRADE to questions about diagnostic tests, to public health, or to health systems questions will face some special challenges [
]. We will address these challenges, particularly those related to diagnostic tests, later in this series. Aware that work remains to be done in refining the GRADE process and addressing areas of uncertainty, the GRADE working group continues to meet regularly and continues to welcome new members to participate in the discussions.
Finally, GRADE will disappoint those who hope for a framework that eliminates disagreements in interpreting evidence and in deciding on the best among alternative courses of action. Although the GRADE system makes judgments about quality of evidence and strength of recommendations in a more systematic and transparent manner, it does not eliminate the need for judgments.
15. Where from here
The next article in this series will describe GRADE’s approach to framing the question that a systematic review or guideline is addressing and deciding on the importance of outcomes. The next set of articles in the series will address in detail the decisions required to generate EPs and SoF tables, such as those presented in Table 1, Table 2. The series will then address special challenges related to diagnostic tests and resource use and the process of going from evidence to recommendations. The series will conclude by commenting on issues of applying GRADE in guideline panels.
The Grading of Recommendations Assessment, Development, and Evaluation (GRADE) system has been developed by the GRADE Working Group. The named authors drafted and revised this article. A complete list of contributors to this series can be found on the Journal of clinical Epidemiology website.