Harms in Systematic Reviews Paper 2: Methods used to assess harms are neglected in systematic reviews of gabapentin

Objective: We compared methods used with current recommendations for synthesizing harms in systematic reviews and meta-analyses (SRMAs) of gabapentin. Study Design & Setting: We followed recommended systematic review practices. We selected reliable SRMAs of gabapentin (i.e., met a pre-defined list of methodological criteria) that assessed at least one harm. We extracted and compared methods in four areas: pre-specification, searching, analysis, and reporting. Whereas our focus in this paper is on the methods used, Part 2 examines the results for harms across reviews. Results: We screened 4320 records and identified 157 SRMAs of gabapentin, 70 of which were reliable. Most reliable reviews (51/70; 73%) reported following a general guideline for SRMA conduct or reporting, but none reported following recommendations specifically for synthesizing harms. Across all domains assessed, review methods were designed to address questions of benefit and rarely included the additional methods that are recommended for evaluating harms. Conclusion: Approaches to assessing harms in SRMAs we examined are tokenistic and unlikely to produce valid summaries of harms to guide decisions. A paradigm shift is needed. At a minimal, reviewers should describe any limitations to their assessment of harms and provide clearer descriptions of methods for synthesizing harms.


Background
Systematic reviews of randomized controlled trials are often considered the pinnacle of the evidence pyramid for answering research questions related to effectiveness. [1] Guidelines recommend that potential harms (Box 1) be assessed alongside potential benefits to avoid one-sided summaries of evidence. [2] A given systematic review might take one of three approaches to assessing harms: pre-specifying all harms of interest, not pre-specifying any harms, or a hybrid approach (Box 1). [3] The choice of approach, might depend of the intervention and setting, which can dictate whether an outcome is treated as a potential harm or benefit. For example, weight gain is considered a harm in trials of antipsychotics but might be a benefit in trials of interventions for eating disorders. These approaches have complementary strengths and weaknesses. [3] Meta-research has shown that primary studies and systematic reviews use poor methods to assess harms and report them poorly. [3,4,[13][14][15][16][17][18][19][20][21][5][6][7][8][9][10][11][12] Guidance on how to synthesize harms is summarized in Box 2, [2,[7][8][9]22] and paper 1 of this series provides an introduction to these issues. [23] Our objectives in this study were to assess whether reviews: used methods described in these guidelines, used appropriate sources of evidence, and applied appropriate methods to synthesize harms. We compared results for harms in a second paper. [24] We selected gabapentin as a case example because it was likely that there would be multiple systematic reviews to compare Additionally, gabapentin is used widely for a range of conditions, of which most prescriptions are off-label. [25] pain, post-operative pain, fibromyalgia, migraine), psychiatric disorders (bipolar disorder, attention deficit disorder, obsessive compulsive disorder, and post-traumatic stress disorder), restless leg syndrome, and vasomotor symptoms, and; (iii) have any results for harms (which could have included a general statement that no harms were reported in the included studies); and (iv) be reliable in methods (i.e., met a pre-defined list of methodological criteria). We defined systematic reviews as articles that either (i) self-identified as a systematic review or metanalysis, or (ii) followed a structured methodology to synthesize research, as per the Institute of Medicine's definition of a systematic review. [7] We focused on reliable reviews, which would be the "best case scenario" for agreement in methods for assessing and reporting harms. We excluded reviews published only as abstracts or for which a full text was unavailable because it is not possible to assess the reliability or methods of a review based solely on an abstract.

Search methods for identification of reviews
We searched PubMed, EMBASE, Epistemonikos, and the Cochrane Database of Systematic Reviews from 1990 to September 17, 2020 without any language restrictions (gabapentin was first approved for use by the United States Food and Drug Administration in 1993). We developed a search strategy with informationists from Johns Hopkins Welch Medical library (APPENDIX B).
We used EndNote then Covidence for de-duplication and screening. Two authors (RQ and TR) screened titles and abstracts, and full texts, independently and resolved disagreements through discussion. We piloted the screening with 100 and 50 records at the title/abstract and full-text levels, respectively.

Data extraction
We extracted all details about the methods used in systematic reviews to assess harms, grouped as follows: approach/planning, searching, analysis, and reporting. Specific items extracted for each of these domains can be found in APPENDIX A. We extracted these data from only studies that we considered to be reliable based on methodologic criteria developed by Cochrane Eyes and Vision United States Satellite (APPENDIX C). [27][28][29][30][31][32] Both reliability assessment and data extraction were done by two reviewers using single extraction with verification, which isas accurate as independent data extraction but less resource intensive. [33] We used the Systematic Review Data Repository to extract the data from all included reviews. [34] We did not formally pilot test the data extraction form; the structure and many items were taken from previously developed forms and the reliability assessment has been used many times. [27][28][29][30][31][32]

Analysis and synthesis
We qualitatively describe and compare the methods used for harms across all studies that met our inclusion criteria. We performed two post hoc subgroup analyses based on (i) have included a search for studies of those harms. By contrast, reviews focused on potential benefits might not include a search for harms at all. Thus, we explored differences in the searches and types of evidence used to address these different types of research questions.
We used Microsoft Excel and Stata 15 to tabulate all results.

Selection of reviews
Fig. 1 depicts the study selection flow diagram. We identified 4320 unique records, reviewed 500 full-text reports, and ultimately included 165 records for 157 reviews (Fig. 1). Of the 157 reviews, 70 were considered reliable 87 unreliable reviews were excluded from further analysis. Table 1A presents the general characteristics of included reviews. Most reviews (61/70; 87%) were published after the 2008 guidance for assessing harms in systematic reviews published by the Agency for Healthcare Research and Quality (AHRQ), [22] including 36/70 (51%) published between 2016 and 2020. We identified reviews for all the conditions/ indications that we anticipated. Most reviews evaluated gabapentin as a single intervention by combining doses across and within studies (61/70; 87%). Eleven (16%) reviews included network meta-analysis (one of which also included pairwise meta-analysis). Of the 60 reviews making direct pairwise comparisons with gabapentin, placebo was the most common comparator, used in 52 (87%) reviews (Table 1A). Three quarters of reviews (51/70, 73%) reported following a guideline for systematic review methods or reporting, commonly PRISMA (26/70, 37%) and Cochrane (23/70, 33%). Seventeen (24%) reviews were not funded and 12/70 (17%) did not report a source of funding. The most commonly reported source of funding was government (24/70, 34%); one review was funded by a pharmaceutical manufacturer (Table 1A).

Methods for assessing harms: approach/planning
No reviews stated they followed any guidance for synthesizing harms (Table 2A).
Most reviews (60/70, 86%) aimed to assess the evidence for both potential benefits and harms-often with a greater focus on benefit-while 10/70 (14%) reviews focused solely on assessing harms. A hybrid approach-wherein at least one pre-specified harm was addressed alongside additional harms identified during the review-was taken by 18 of 70 (26%) reviews. Nearly equal numbers of reviews either did not pre-specify any harms of interest (27/70, 39%) or assessed only pre-specified harms (25/70, 36%) (Table 2A). Among the 28 reviews for which a protocol was available, either through publication or registration (e.g., PROSPERO), harms were mentioned and addressed in 24 (86%) ( Table 2A).

Methods for assessing harms: searching
Only 22/70 (31%) reviews performed supplemental searches for unpublished studies or data from regulatory agencies or industry. Table 1B presents characteristics of the searches used by the included reviews. The median [interquartile range] number of databases searched was 4 [3 to 5]. Most reviews had multiple search components, including: searching references, contacting experts in the field, searching for grey literature or conference abstracts, or searching registries for ongoing studies. Almost all reviews included randomized controlled trials (69/70, 99%); 43 (61%) included no other study types.
We found that not many reviews searched for types of studies used only to assess harms (e.g., observational studies of harms) (Table 1B). Our subgroup analyses based on the review purpose (i.e., assessing harms or benefits) and the approach to pre-specification of harms (i.e., whether any harms were pre-specified) found no systematic differences in the search methods or the types of evidence included in reviews. APPENDIX D contains the table of search methods and types of evidence by subgroup.
Of the 70 reviews, 44 (63%) performed a quantitative analysis of at least one harm. Eleven (25%) of the 44 conducted a network meta-analysis for harm or a proxy for harm (Table 2B). Only three reviews specified their analysis methods were different for harms than benefit outcomes in either a protocol or a final report-all others that included a meta-analysis (41/44, 93%) appeared to use the same analysis methods for both benefit and harm outcomes. The method used to handle rare events in meta-analysis (i.e., including zero-event cells from studies) can affect the validity of the estimates [35,36] and we found including studies with zero-events in one treatment group was done in a greater number of reviews than including studies with no events in either the gabapentin or comparison group (Table 2B). Most of the reviews with a quantitative analysis 80% (35/44) did not report any approach to handling missing data from included studies (e.g., imputing missing standard deviations of estimates) (Table 2B).
Most reviews specified whether they used fixed-effect or random-effects models used for analysis. Only three (7%) reviews did not report what type of meta-analysis they conducted for harms: 32/44 (73%) reported at least one random-effects analysis and 15/44 (34%) reported at least one fixed-effects analysis for harms. Although the defaults for meta-analysis in most statistical programs used to conduct meta-analysis (e.g., RevMan, R, Stata) are inverse-variance models, which are often biased for analyzing rare events, [36][37][38][39] we found the most common meta-analysis model was Mantel-Haenszel (19/44, 43%) with only 9/44 (21%) reviews not specifying what model was used for meta-analysis, suggesting most systematic reviewers know to change the model from the default when analyzing harms.

Methods for assessing harms: reporting
Harms reporting was often unclear and inconsistent. Most reviews (57/70, 81%) included a statement about harms in the abstract; either about specific harms (31/57, 54%) or a more general statement about the potential for harm (26/57, 46%) (Table 2A). Many reviews (31/70; 44%) did not state whether any selection criteria had been used in the reporting of harms, and 37% (26/70) reported only pre-specified harms (Table 2A). Selection criteria are the rules defining which harms will be reported; these often involve numerical thresholds (e.g., "harms occuring in at least 2% of participants") (Box 1). Selection criteria were used in 13/70 (19%) reviews, including both vague criteria such as reporting only "the most common harms" and more specific criteria such as reporting "only harms commonly reported in all included studies".
Most reviews (46/70, 66%) included a statement about limitations specifically for harms. These limitations were commonly about adverse effects being poorly or inconsistently reported among included studies, trial study designs (e.g., sample size, duration, etc.) being insufficient for identifying harms, limiting included studies to trials, and not searching uncontrolled or unpublished literature.

Discussion
The desire to address harms in systematic reviews has led to tokenism. Guidance indicates that all reviews should assess harms, [2,7,22] but reviews rarely focus on harms, and reviews focusing on benefits rarely use appropriate methods to identify and synthesize evidence about harms. Some limitations in reviews stem from limitations in the included studies that could not be corrected with improved systematic review methods. Other limitations in the reviews stem from the methods chosen by reviewers, including previously identified limitations that have not yet been addressed. [4,6,15,20,21,[40][41][42] Additionally, with a focus on potential benefits, most intervention reviews are limited to randomized controlled trials. [2,6] Randomized controlled trials are rarely designed to address harms, and publicly available evidence about trials is usually insufficient for assessing harms. [2,18,19,[41][42][43][44][45][46][47] Consequently, systematic reviews of trials may be doomed to draw premature and sometimes invalid conclusions about the balance of benefits and harms. [3,6,22,[48][49][50][51] To address these challenges to synthesizing information about harms, a paradigm shift is needed. Cochrane and other producers of systematic reviews should reconsider how best to guide review authors to assess harms in systematic reviews that are designed primarily to assess potential benefits. We believe that systematic reviews specifically focused on harms are needed. Because many drugs are used for multiple indications, systematic reviews of harms that are limited to specific indications will lead to incomplete evidence. When appropriate, Consciously separating reviews of benefits and reviews of harms could increase validity, avoid unintended overlap, and reduce overall burden on systematic reviewers.
Appropriate methods to review harms differ from appropriate methods to assess potential benefits. For example, multiple study types should be considered when answering questions about harms, so systematic reviewers should not exclude observational data at the outset of the review process. [3,7,22] While randomized controlled trials can assess harms, especially common harms, most are not designed to do so. Some, but not all, limitations could be overcome in studies with larger samples, more diverse participants, and longer duration. By comparison, non-randomized studies can be misleading because of uncontrolled biases and confounding, so reviewers interested in harms should have training and experience assessing both randomized and non-randomized studies. For example, whereas non-randomized studies of statins showed an increased risk of myalgias, randomized trials and meta-analyses of them have shown no increase in risk. [52] Additionally, reviewers should anticipate including unpublished data when published data on harms is incomplete or likely to be inadequate for addressing the review question. [13,14,[41][42][43][44][45][46]51,53] Second, the reporting of analyses of harms requires greater detail in reviews. Reviewers should specify how the analyses are conducted (whether they were assessed the same or differently from benefits) and key assumptions such as handling of rare events and missing data. [21,40] Common meta-analysis models for efficacy outcomes, and common assumptions for how to handle rare events and missing data from included studies, are problematic for rare events. [9,[35][36][37]39] Thus, reviews about harms might require more statistical expertise and support compared with reviews about benefits.
Third, and perhaps most importantly, reviewers should discuss the choice of selection criteria for assessing and reporting harms (Box 1). Selection criteria lead to reporting bias in primary studies, and the additional application of selection criteria in systematic reviews further obfuscates the evidence on harms. [17][18][19]54] Moreover, it is often unclear whether and which selection criteria have been used Existing guidance for systematic reviews, such as the PRISMA-harms extension, does not address the use of selection criteria. [8] When selection criteria are not reported, users of systematic reviews will not know if reported harms include: (a) all harms identified in all included studies; (b) only harms that were pre-specified; or (c) a subset of all harms selected post hoc. Part 2 of this methodologic study looks more closely at the results for harms of these reviews.
Reviews of harms require resources and information that are unavailable to many academic researchers. In this regard, responsibility for conducting and publishing systematic reviews of harms may fall on government organizations. For pharmaceuticals and devices, regulators such as the US Food and Drug Administration are well-positioned to conduct these reviews properly because they have access to individual participant data from trials, and they have epidemiologists and biostatisticians skilled at analyzing real world data on harms. Although regulators might be unable to undertake all the research they would like to conduct, publishing studies they do conduct would be valuable contribution to policy and practice.
Lastly, reviews should describe the limitations of any harms assessments. Following every recommendation above might be impossible or unnecessary in every review. so if recommended methods are not followed, then reviewers should provide a rationale and report any associated limitations so that readers can evaluate their confidence in the results.
Peer reviewers and journal editors should strive to ensure that reporting guidelines are followed and that systematic reviews do not present overly confident summaries of harms when they use unreliable methods.

Conclusions
Consistent with the results of previous research, the current methods of assessing harms in the systematic reviews we examined does not produce valid evidence about harms. The underlying issues we identified demand a paradigm shift rather than incremental change or further guidance. Harms are currently included as tokenistic outcomes in reviews for the sake of assessing the "balance" of benefits and harms. We propose that reviewers fundamentally revise how we approach harms from review conception through resource collection and analysis to reporting.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.

What is new?
• Despite the existence of guidelines and recommendations for how to assess harms in systematic reviews, reviews we examined do not appear to adhere to best practices.
• Synthesizing evidence of harms is more challenging and requires substantially more effort than assessing benefits; however most reviews examined in this study had methods oriented toward answering questions of benefit.
• Selection criteria applied to harms in primary studies affect the completeness of harms data, yet reviews also have selection criteria that further bias their results and limit information available to evidence users.

•
The current approach to assessing harms in systematic reviews needs immediate improvement.

Term Definition
Terms related to harms Harms "Harms" is a general umbrella term to cover the concept of risk that may be associated with an intervention. "Harms" is used to refer to all related ideas, such as adverse events, side effects, tolerability, or safety.

Systematically collected harms
According to The Final Rule, '''systematic assessment' involves the use of a specific method of ascertaining the presence of an adverse event (e.g., the use of checklists, questionnaires, specific laboratory tests at regular intervals)''. Like a potential benefit of treatment, a systematic AE can be defined using five elements: (1) domain, (2) specific measurement, (3) specific metric, (4) method of aggregation, and (5) time-point [13]. For example, ''proportion of participants with 50% change from baseline to 8 weeks on the Young Mania Rating Scale total score.'' Non-systematically collected harms According to The Final Rule, '''non-systematic assessment' relies on the spontaneous reporting of adverse events, such as unprompted self-reporting by participants.'' Non-systematic adverse events may be collected by asking questions like ''Have you noticed any symptoms since your last examination?'' Unique harms A specific harm such as would be reported by someone receiving an intervention, such as "dizziness", "edema", or "somnolence".

General assessment of harm
A non-specific method of assessing harms that aims to summarize multiple aspects of risk into a single measure, such as "occurrence of any harm", "occurrence of serious adverse events", or a composite of several unique harms.
Proxy for harm A surrogate method of assessing harm that is not a direct representation of harm from an intervention, such as "loss-to-follow-up or drop-out due to harms".
Terms related to review methods

Pre-specification of harms
An approach to assessing harms in systematic reviews whereby reviewers have one or more harms in mind that they consider important and pre-specify as outcomes of interest for their review. These pre-specified harms are the only harms that are assessed in the review.
No pre-specification of harms ("exploratory") An approach to assessing harms in systematic reviews whereby reviewers do not pre-specify any harms of interest as outcomes for their review. Reviewers assess only harms identified in the review process. A review can specify that they will broadly assess harms as an outcome and still be exploratory if they do not note any specific harms of interest.

Hybrid
An approach to assessing harms in systematic reviews whereby reviewers prespecify at least one harm to assess in the review and also assess harms identified during the review process.

Primary search for evidence
The principal sources of evidence for a search that are recommended as standard for all systematic reviews, including: bibliographic databases, grey literature, study registries, content experts, and reference lists of included studies.

Supplemental search for evidence
The additional sources of evidence for a search that are not standard for systematic reviews and are uncommonly performed, including: searching unpublished data, adverse event reporting systems, and hospital and other databases.

Descriptive assessment of harm
A narrative description of the harm(s) reported in studies included in the review that does not involve meta-analysis of estimates across studies.

Quantitative assessment of harm
The statistical combination of estimates for harm(s) across two or more studies included in the review. (i.e., meta-analysis for a harm).

Selection criteria
The specific rules that are used to define a subset of harms that will be reported among all of the harms collected. Selection criteria are often based on numerical threshold and participant group (e.g., ≥ 5% of participants in the intervention group).

Approach •
Assess multiple harms and focus on specific harms (as opposed to general assessments of harm);

Sources of evidence
• Search multiple sources of evidence to identify information about harms; • Utilize any unpublished data for included studies that can be obtained; • For reviews that include harms, do not restrict studies to randomized controlled trials; instead, include observational studies, case reports, adverse event reporting systems;

Methods of analysis •
Absolute measures (e.g., risk difference) might give a better indication of public health impact over relative measures (e.g., risk ratio and odds ratio) when conducting meta-analyses, although these are complementary and should both be considered; • Avoid inverse-variance and Dersimonian and Laird methods when selecting a model for meta-analysis for rare events, and give preference to Bayesian or select frequentist models (Peto one-step odds ratio or Mantel-Haenszel odds ratio without zero-cell corrections);

Reporting •
Report methods for assessing harms in systematic reviews, including details of how the primary studies assessed harms.
Qureshi et al. Page 14
Characteristics of included reviews of gabapentin (n = 70)