A Prospective Comparison of Evidence Synthesis Search Strategies Developed With and Without Text-Mining Tools

OBJECTIVE
We compared the process of developing searches with and without using text-mining tools (TMTs) for evidence synthesis products.


STUDY DESIGN
This descriptive comparative analysis included seven systematic reviews, classified as simple or complex. Two librarians created MEDLINE strategies for each review, using either usual practice (UP) or TMTs. For each search we calculated sensitivity, number-needed-to-read (NNR) and time spent developing the search strategy.


RESULTS
We found UP searches were more sensitive (UP 92% (95% CI, 85-99); TMT 84.9% (95% CI, 74.4-95.4)), with lower NNR (UP 83 (SD 34); TMT 90 (SD 68)). UP librarians spent an average of 12 hours (SD 8) developing search strategies, compared to TMT librarians' 5 hours (SD 2).


CONCLUSION
Across all reviews, TMT searches were less sensitive than UP searches, but confidence intervals overlapped. For simple SR topics, TMT searches were faster and slightly less sensitive than UP. For complex SR topics, TMT searches were faster and less sensitive than UP searches but identified unique eligible citations not found by the UP searches.


Introduction Background
Given the explosive growth in biomedical evidence, information retrieval methods research is needed to ensure efficient and effective search processes. Limited investigations to date suggest that the use of text-mining tools (TMTs) in systematic reviews save production time and improve the quality of search results. [1][2][3][4] An "objective" approach to developing search strategies using TMTs has been adopted and validated by Germany's Institut für Qualität und Wirtschaftlichkeit im Gesundheitswesen (IQWiG). [5][6][7] Hausner et al. define the objective approach as comprising a set of steps: "generation of a total set (relevant references from systematic review), splitting of the total set into a development set and comparator set, development of the search strategy with references from the development set (analyzing information derived from the titles and abstracts of relevant references with text-mining tools), and validation of the search strategy (checking whether references from the comparator set can be identified with the search strategy developed beforehand)." 6 To ascertain the applicability of text-mining-based search approaches in a real-world setting, across a variety of review topics and using freely available tools, and specifically inform the practice of evidence synthesis librarians working for the Agency for Healthcare Research and Quality (AHRQ) Evidence-based Practice Center (EPC) Program, a group of EPC librarians and other methodologists conducted a prospective comparative study of search strategy development with and without TMTs.

Search Strategy Development
In their usual practice (UP), evidence synthesis librarians develop search strategies using a sequential process of tasks to identify search terms, generate a logic structure, and evaluate performance (see Figure 1). Many subprocesses focus on achieving an optimal combination of search keywords and subject heading terms (e.g., Medical Subject Headings [MeSH]) to retrieve all relevant citations on a systematic review topic with as few irrelevant citations as possible. Finding an acceptable balance of search sensitivity/recall to precision/number needed to read (NNR) is time consuming and requires analysis of exploratory search strategies and known relevant citations. 8,9 For this study, the UP method for developing search strategies was evaluated against a TMT method, specifically in respect to three search subprocesses shown in Figure 1: (1) developing model keyword (title and abstract) terms/phrases, (2) developing model subject heading terms, and (3) evaluating model search strategy, including reviewing relevant citations.

Text-Mining Tools
Many software programs analyze textual documents and bibliographic citations. 10,11 Some measure the frequency of term/phrase occurrences in a corpus of text, some suggest subject heading terms based on a set of bibliographic citations or a sample of text, and others generate a visual representation of search results to show relationships (e.g., between authors or related topic areas). TMTs may be custom built to support systematic review production or intended for entirely unrelated research tasks. They may be web-based or require downloaded software, either freely available or through a paid license. Resource costs are also associated with the time needed to learn how to use these tools; while some TMTs are intuitive and simple, others are complex programs that require invested training time.
The variety of available TMTs makes it difficult to decide which tool to use for search strategy development and to determine if the benefit to the systematic review is worth the time needed to learn how to use these tools. 12,13 This investigation informs on the utility of freely available TMTs for the benefit of EPC evidence synthesis librarians and a broader international community of systematic review producers.

Objectives
The objectives of this study were to compare the benefits and tradeoffs of searches with and without the use of TMTs for evidence synthesis products in real world settings. Based on prior research [5][6][7][12][13][14] and what we knew about the role of some TMTs in term generation and other TMTs as a way to identify irrelevant term clusters, we hypothesized that TMTs would increase sensitivity and precision (operationalized as NNR [1/precision]), and reduce time needed to develop search strategies.

Key Questions
The following Key Questions (KQs) guided our investigation and the decision to collect both quantitative and qualitative data to inform our results: 1. Do TMTs decrease the time needed to develop keyword and subject term strategies compared to conventional approaches? 2. Does increased search recall from TMTsa. improve the yield of relevant citations that would not have been found using conventional techniques? b. result in an unreasonable number of excess citations for screeners given existing staff resources? 3. Does using TMTs in the draft search strategy evaluation step improve the final search by identifying groups of irrelevant records which can safely be removed from the results (improving precision)? 4. Does the type of review topic (complex versus simple) make a difference in the performance of TMTs, according to the criteria evaluated in Questions 1-3?

Project Identification and Recruitment
The investigators identified systematic reviews with an approximate completion date of February 2020 that were eligible for inclusion in this project. This date was chosen in an attempt to ensure that all projects were completed before analysis. To ensure a variety of review types, eligible reviews were any systematic review or systematic review update funded by AHRQ or conducted by the U.S. Department of Veterans Affairs Evidence Synthesis Program (VA ESP), or any Cochrane group. These groups were chosen because the project librarians overlapped with our research team of six EPC-affiliated librarians, and were known to follow the methods set out in the EPC Methods Guide 8 and Cochrane Handbook 9 for search strategy design. A target of 10 included reviews was set, including one pilot review to test our process. The target was set to achieve as large a sample as possible within the project timeframe (1 year). We estimated the average number of reviews produced by these groups annually and chose a slightly lower number, anticipating recruitment might be difficult. The six EPC-affiliated librarians discussed each review topic and came to a consensus on classifying it as simple or complex to assess whether adding TMTs was more useful for simple or complex topics.
The Scientific Resource Center (SRC), the methods support center for the EPC program, assisted in coordinating between investigators and review teams. Once a potential review was identified, the SRC contacted the review team, confirmed their participation, and then assigned the study to a TMT search librarian. Three review teams did not respond to our request or declined to participate, due to the extra workload. The study design is summarized in Figure 2.
The MEDLINE search strategy was completed by one UP librarian and one TMT librarian. The UP librarian was the librarian on the team running the chosen review; the TMT librarians were randomly chosen from a group of six EPC-affiliated librarians. All UP and TMT librarians had a master's degree in library science and were affiliated at the beginning of the project with an EPC. More details about the librarians are in the results section. Several librarians worked on more than one project, and a librarian could be a UP librarian on one project and a TMT librarian on another. The UP librarian provided the SRC with the search background materials (the review protocol, minutes of relevant meetings, and other similar documents) and a bibliography or list of PubMed IDs (PMIDs) to be used as seed citations for the TMT librarian's use.
The UP and TMT librarians developed their searches simultaneously and independently using the same platform (Ovid or PubMed), and the TMT librarian search was limited to the same date range as the UP final search. We did not prescribe how TMT librarians were to use the tools, so TMT librarians were free to use UP practices, including determining Boolean structures, ensure all concepts had both MeSH and free-text terms, and checking the results against known citations. TMT and UP librarians were matched by platform because the different interfaces yield different results, thereby reducing potential confounding due to search platform. Librarians remained anonymous throughout the study. All searches were peer reviewed by a second librarian using the Peer Review of Electronic Search Strategies (PRESS) assessment form (http://www.sciencedirect.com/science/article/pii/S0895435616000585) 15 to help ensure that both searches were of high quality.

Selection of Text-Mining Tools
TMT librarians could choose one or more TMTs per category from a preselected list in each of three categories: (1) keyword/phrase tools: AntConc, 16 PubReMiner 17 ; (2) subject term tools: MeSH on Demand, 18 PubReMiner, 17 Yale MeSH Analyzer 19 ; and (3) strategy evaluation tools: Carrot2, 20 VOSviewer. 21 The TMTs and categories were agreed upon by the investigators before the study began and reflect the free, open-source, or web-based tools, available to most librarians (note that local security firewall issues may preclude use in individual information technology environments) and known to the investigators at the outset of the study. Due to the exploratory nature of research on using text-mining tools in systematic reviews searching, we did not prescribe how librarians used them in our study because the evidence for best practices is still in its infancy. Thus, each librarian had the flexibility to use the tools in the way they considered them to be most helpful. The specific tools and methods each librarian used were captured in the tracking sheet (See Appendix A).

Complex Versus Simple Review Topics
Identifying keyword/phrase and subject terms for a narrowly defined clinical topic (i.e., a single drug for a single indication) is a relatively straightforward process; see for example, "Pharmacotherapy for the Treatment of Cannabis Use Disorder." 22 However, the difficulty of the task is magnified for complex topics, such as: multiple drugs for multiple indications; topics requiring a complicated logic search structure; or diffuse multicomponent interventions (i.e., health services topics); see for example, "Maternal and Fetal Effects of Mental Health Treatments in Pregnant and Breastfeeding Women: A Systematic Review of Pharmacological Interventions." 23 The recognized impact of complexity on the process of conducting systematic review 24 warranted an exploration of variability in TMTs performance for different topic types. The research team prospectively discussed each review as it was identified and decided by consensus if it was simple or complex.

Text Mining Tools Librarian Assignment
TMT librarians were assigned to searches by the SRC to conceal their identities from review teams and other librarians. Not all EPC librarians have access to the Ovid Platform, and some prefer to use PubMed.gov for conducting searches. For the purposes of this study, the content of these two versions of the database were considered equivalent. Thus, when MEDLINE appears in the report text, it indicates the librarian searched either PubMed.gov or Ovid MEDLINE Epub Ahead of Print, In-Process & Other Non-Indexed Citations or Ovid MEDLINE ALL. All PubMed platform searchers conducted their searches in the Legacy PubMed interface.
To avoid confounding due to the differences in these platforms (in terms of search construction and functionality), if the UP librarian searched PubMed, then we chose a TMT librarian who also searched PubMed; similarly, if the UP librarian searched Ovid MEDLINE, then we chose a TMT librarian who also searched Ovid MEDLINE. Some of our chosen TMTs are designed to receive and export information using PubMed syntax only. Because the Ovid platform uses different syntax, Ovid users developed methods to work around this either by inputting a list of PMIDs or recreating a simple PubMed MeSH search in those tools.

Outcome Assessment and Data Analysis
After each librarian developed and conducted her search, the deduplicated retrieved citations were sent to the SRC. The SRC then coded both sets of retrieved citations with unique identifiers to indicate whether a record was unique to the UP search, unique to the TMT search, or retrieved by both. The SRC then sent the review team the combined results of the UP and TMT searches to incorporate into their screening process. After the draft review was completed, the review team sent a list of citations included in the draft report, from any search (UP, TMT, or other sources) for which there was a PMID. Thus, the final report citations may vary from those used in this analysis. The records from this list that had PMIDs were the reference standard included citations and defined how many records from each search were included in the final synthesis. Sensitivity and NNR were calculated against the reference standard for each review for each approach. The complete data tables are in Appendix B.
For each study, both librarians completed a prospectively designed tracking sheet (see Appendix A), indicating the number of hours spent, the MEDLINE platform searched (PubMed or Ovid), and the total number of citations found after deduplication. Time spent was recorded in total and by specific task: MeSH term generation, keyword phrase generation, and strategy evaluation.

Quantitative Assessment
To operationalize KQ1, we summed the number of hours each librarian spent on each aspect of the search and overall.
To operationalize KQ2a, we calculated sensitivity as identified included / (identified included + not identified included). For KQ2b, we calculated precision as identified included / total citations retrieved; and NNR as 1 / precision. We operationalized these variables as described in Tables 1 (UP searches) and 2 (TMT searches). Because of the small sample size, the analysis was limited to descriptive measures, including means, standard deviations, and 95% confidence intervals, and formal statistical testing was not performed.

Qualitative Assessment
In the tracking sheet, TMT librarians identified the tool(s) used and answered additional qualitative questions about their process. We elicited TMT librarian comments for some previously unaddressed issues. In order to understand how TMT librarians were using the tools, we asked for a brief description of the methods used in creating the text-mining search. To evaluate whether the seed set of citations used in the TMTs was sufficient and unbiased, we asked for the number of known citations used in the seed set (using predefined response categories) and the TMT librarian's estimation of the representativeness of the seed set. The later response categories included: "overly comprehensive," meaning that it turned up a large number of clearly irrelevant terms; "perfectly balanced," meaning that the terms retrieved appeared sufficient to cover the entire topic but did not include a large number of irrelevant terms; or "subset of vocabulary terms," meaning that the terms returned did not sufficiently cover the topic and the librarian had to add relevant terms identified using different methods. To better understand how the tools could be used, we asked when the software offers multiple types of

Results
We recruited three organizations to participate in the study: the AHRQ EPC Program, the VA ESP, and the Cochrane Collaboration. We approached 12 review teams, and nine agreed to participate in the study. Two systematic reviews were not included in the final quantitative analyses: one due to protocol violations and the other because our study period ended before the final included citations list was available. The seven evidence syntheses included five de novo systematic reviews, one systematic review update, and one evidence map. These reviews were classified into simple (n=4) or complex topics (n=3). Table 3 lists the review titles and classification, as well as the TMT used. All TMT searches used PubReminer as the keyword/phrase tool. This may result from the comparative ease of using PubReminer. No other tool was consistently used or not used.
In the qualitative section, in addition to the seven reviews in the quantitative analysis, we also included comments from the eighth review whose quantitative results we did not receive before data collection ended.

Complex
PubReminer, Yale MeSH Analyzer, Carrot2 *Only available on the VA Intranet **Only qualitative results have been analyzed due to nonavailability of quantitative results at close of data collection.

Participating Librarians
All six participating librarians (UP and TMT) have a master's degree in library science and worked within the EPC Program at the beginning of the study. There was a large overlap, with several librarians serving as both a UP and TMT librarian on different projects. The librarians who served as only UP or TMT do not differ in any substantial way from each other or from the librarians who performed both UP and TMT searches. Table 4 presents summary descriptive data on participating librarians. No TMT librarian was particularly familiar with any topic in a way that could bias the results. Peer reviews of all the search strategies elicited suggestions for keywords or MeSH terms in one UP search and one TMT search.

Quantitative Results
The quantitative results are based on seven reviews, four of which were classified as simple and three complex. We present the across-study summary data in Table 5. Please see Appendix B for the data tables upon which Tables 5-6   Abbreviations: NNR = number-needed-to-read; SD = standard deviation; TMT = text-mining tools; UP = usual practice.
Note: NNR is rounded to the nearest whole number for ease of interpretation.

Subgroup Analysis
We included both de novo reviews and a systematic review update in our study. Update searches often require significant reworking of the search strategy and the original review is used as a source of seed citations. However, in the case of R5 (an update search) the UP librarian reused the original search, leading to discrepant results. We therefore did a subgroup analysis without R5, recalculating the sensitivity, NNR, and time spent to evaluate its effect on the all reviews and complex review results. We present the results in Table 6. Overall, removing that review increased mean sensitivity and NNR for the UP and TMT process. It also increased the average time required for the UP search but did not affect the average time required for the TMT search.

Key Question 1: Time Developing Search Strategy (Hours)
The average number of hours to create the search by UP librarians was 12 hours (SD 8 hours), compared to 5 hours (SD 2 hours) for the TMT librarians. Figure 3 shows the total time spent by each librarian on each review, across all three tasks (keyword/phrase, MeSH terms, and strategy evaluation), classified by simple and complex review types. In all but one review (R5), the UP search took more time than the TMT search. The R5 UP librarian reported a very short time because the project was a systematic review update and the librarian ran the existing strategy, while the TMT librarian edited the search with new terms. Figure 4 breaks down time spent across simple and complex reviews by task for each librarian. The time savings does not come from any single task, as (except for finding MeSH terms for simple reviews) TMT librarians spent less time on each task than UP librarians. The "other" category accounts for time spent in other search activities, such as attending review team meetings for usual practice librarians or learning how to use a text-mining tool for text-mining librarians.  Note: Time is rounded to the nearest whole number because the data were imprecise.

Key Question 2a. Yield of Relevant Citations (Sensitivity)
Across all reviews, we found that the UP searches appeared to be more sensitive, with an average sensitivity of 92 percent, while the TMT searches had an average sensitivity of 84.9 percent (calculated on the average of sensitivities, rather than the sensitivity across individual projects). See Figure 5 for results by review, classified by simple and complex review types. Overall, between 5 and 19 unique relevant citations were identified using the UP approach, and between 1 and 4 additional citations were identified using the TMT approach ( Figure 8).

Key Question 2b. Burden of Excess Citations (NNR)
The overall mean NNR results were 83 (SD 34) for the UP librarian and 90 (SD 68) for the TMT librarian, a mean difference of seven more for the TMT librarian (see Figure 6, results grouped by simple and complex review types). NNR was calculated by combining averages, rather than by the average NNR across individual projects. Note: NNR is rounded to the nearest whole number for ease of interpretation. R4: only qualitative results have been analyzed due to nonavailability of quantitative results at close of data collection.

Key Question 3. Identification of Irrelevant Records During Search Strategy Evaluation
This objective was not addressed in the quantitative results but is discussed below in the qualitative results section.

Time Developing Search Strategy (Hours)
As shown in Figures 3 and 4, TMT librarians saved more time on complex reviews (average 8 hours saved) than on simple reviews (average 5 hours saved). In complex reviews, for usual practice there was a large range (1 hour in R5, which was the update review, 10 hours in R7, and 32 hours in R2), while TMT times were more consistent (at 6 hours in R2 and R5 and 8 hours in R7). In simple reviews, the times were more consistent across reviews, ranging from 10 to 12 hours for the UP librarian and 2 to 8 hours for the TMT librarian. The average time for simple reviews is not as much less than we had expected, possibly because simple reviews required identification of multiple terms for each concept (e.g., a wide variety of aromatherapy interventions) despite having a relatively simple logic structure.

Yield of Relevant Citations (Sensitivity)
In simple reviews, on average, one study identified by the UP search was not identified by the TMT search; the TMT search found no articles not identified by the UP search (a mean sensitivity for UP of 96% compared to 92% for TMT). In complex reviews, the mean sensitivity was lower for both strategies, and the difference in mean sensitivity was greater between the strategies (87% for UP and 75% for TMT) (see Figure 5 for results displayed graphically). Figure 7 shows that across all reviews, UP searches identified a greater percentage of the relevant (reference standard final included) citations than TMT searches, with the exception of R8 in which both UP and TMT searches identified all included citations. Dark blue bars indicate relevant studies found by both methods, while light blue bars indicate relevant citations found by one method and not the other. In simple reviews, UP searches identified unique included citations in 3 out of 4 reviews (light blue bars) while TMT searches did not retrieve any unique included citations. In complex reviews, UP searches identified more unique included citations than TMT searches across all 3 reviews, but TMT searches did identify at least one unique included citation in each review. The dotted bars indicate the relevant citations not found by a search.

Burden of Excess Citations (NNR)
For simple reviews, the TMT search yielded an average 22 more articles per relevant article than the UP search (NNR), but for complex reviews the TMT search yielded an average 14 fewer articles for evidence synthesis team screeners to review per relevant article than the UP search (see Figure 6 for results displayed graphically). Figure 8 shows that in 4 out of 7 reviews, UP searches retrieved a greater percentage of the total citations (both relevant and not relevant) than TMT searches (dark and light blue bars combined), leading to a greater screening burden. In simple reviews, UP and TMT searches each retrieved a greater percentage of the citations in 2 out of 4 reviews. In complex reviews, UP searches retrieved a greater percentage of the citations in 2 out of 3 reviews. Overlap in citations (dark blue bars) found by both searches ranged from a low of 11.2 percent (R5) to just over 62 percent (R3). This overlap of irrelevant citations retrieved by the different search methods is smaller than we would have expected. The combined mean NNR of UP and TMT searches is 123, indicating that a combined UP and TMT search may lead to greater screening burden. Please see Appendix B for complete data (Figure 7 is column 2 and Figure 8 is column 3). Note: R4: only qualitative results have been analyzed due to non-availability of quantitative results at close of data collection.

Qualitative Results
All the TMT librarians had limited experience using TMTs when the study began. Due to the brevity and small number of comments a full content analysis was not performed; however, we present some topics with selected, edited responses below. Full qualitative comments are available in Appendix C.

Identification of Irrelevant Records During Search Strategy Evaluation Step
The majority of qualitative comments suggest that TMTs most often did not identify irrelevant concepts for removal from the search strategy. However, one TMT librarian reported that they were helpful:

Evaluating Seed Set for Bias
As described in the methods section, we developed two questions to evaluate whether the seed set of citations used in the TMTs was sufficient and unbiased ( Table 6): (1) the number of known citations used in the seed set and (2) the TMT librarian's estimation of the representativeness of the seed set.
The seed sets for simple topics were evenly split between 11-30 citations (n=2) and 31-60 citations (n=2); whereas, the seed sets for complex topics were mixed, including 31-60 (n=1) and 101+ (n=2). No review topics used fewer than 10 or 61-100 known citations, but this is probably because our sample size was small. In Table 6, we report the number of seed set citations, along with TMT librarian comments by review number. Grouping the results by number of known citations suggests that when there are more known seed set citations the vocabulary terms derived from a TMTs analyses are likely more representative of the review topic. For reviews with fewer than 60 known citations, the variety of responses suggest wide variability in the usefulness of the citations for use as a keyword or MeSH term development set.

Table 7. Number of known citations used in text-mining seed set by review with comments Review
No. seed set citations*

11-30
Overly comprehensive (lots of junk terms). Comment: It was difficult to determine how representative the known citations were of the topic area.

11-30
The citations may have been a little broad, but generally seemed good.

11-30
Subset of vocabulary terms (had to supplement elsewhere) Comment: I had to add a number of terms that were not identified through TMT, probably because there were so few seed citations.

31-60
Perfectly balanced (had all needed terms without a lot of junk).

31-60
This was a very clean search.

Review
No. seed set citations*

R7 (Complex)
31-60 Subset of vocabulary terms (had to supplement elsewhere). Comment: The text-mining tools did help to identify some relevant keywords & MeSH headings however there were many irrelevant results to wade through to find a small number of relevant terms. Seeing the terms out of context in PubReMiner for unfamiliar topic areas was less than helpful and required additional follow-up to determine if keywords were relevant or not. Would have preferred to use a tool where keywords could be viewed in groups (bigrams, trigrams).

R2 (Complex) 101+
Perfectly balanced (had all needed terms without a lot of junk). Comment: There was a lot of junk, but this is a complex topic, so I think fewer citations would have led to gaps in the search.

101+
Perfectly balanced (had all needed terms without a lot of junk). Subset of vocabulary terms (had to supplement elsewhere). Comment: I would say it was somewhere between these two actually…this was a complex search and it was also an update search, so I had the list of included citations and the existing search to begin. I was also aware of the possibility that exclusively using the existing include list might bias the results (e.g., to vocabulary being used at the time of the previous review (intervening semantic drift) or if the original review search was not inclusive enough), so once a fairly robust search was established, I then re-ran the TMT searches to determine if there were other additional text and MeSH terms to consider. I also experimented with creating an initial search with all the known terms, finding systematic reviews, meta-analyses, trials with those words in the title, taking a sample to plug into PubReMiner and VOSviewer. I thought this approach might work for new review topic searching as well. *Responses to the question "What number of known citations did you use in the seed set? (up to 10, 11-30, 31-60, 61-100, 101+)" ** Responses to the question "Did the known citations used for text-mining analysis represent the diversity of vocabulary terms or a subset of terms used in this area of research? (overly comprehensive [lots of junk terms], perfectly balanced [had all needed terms without a lot of junk], subset of vocabulary terms [had to supplement elsewhere])" Abbreviations: R= review; TM= text mining; TMTs= text-mining tools

Developing Methods for Using Text-Mining Tools in Systematic Review Searches
We collected initial experiences of using freely available TMTs to expand our field's real world understanding of how to approach this new class of search tools. Below are three selected and edited quotes on search techniques, followed by comments on specific tools by TMT librarians. See Appendix C for the complete comments.
"I used the list of known relevant systematic reviews from the provided Excel spreadsheet to create a list of PMIDs to enter into PubReMiner & the Yale MeSH Analyzer. The results were reviewed to identify relevant keywords & MeSH headings. Several of these keywords & MeSH headings were then entered into PubReMiner & MeSH on Demand to identify additional relevant terms. In addition, I also input various portions of text from the Systematic Review Protocol document into MeSH on Demand to identify other relevant headings." "The text-mining tools were a great compliment to usual practice and going forward I plan to utilize them more often during the strategy development period. However, I would not feel comfortable designing a strategy solely using text-mining, as there are many irrelevant results returned and the lack of context for unfamiliar topic areas requires additional followup. While working on this project I developed a routine of flagging potentially relevant keywords & headings, which then required me to do additional research to see if they were in fact useful for the strategy." "I generated my seed set by: (1) using references in the protocol, (2) running a quick PubMed query and looking at related references, (3) identifying review articles on the topic and then adding their included citations. I'm unsure if there are other more effective methods to identify test articles, or if my approach was appropriate?"

Comments on Text-Mining Tools Overall and by Specific Tool
We were interested in gleaning real-life experiences using the study's TMTs (i.e., what works, what were particularly easy/difficult tasks, etc.). Most TMTs are developed to use default PubMed record output, so users of other platforms must create work arounds. One TMT librarian commented that in general the tools increased the complexity of the process: "Time was added to search process to tweak and troubleshoot issues related to constraints of the TMTs (character limits, output limits, search input formatting issues, etc.). For example, while using some of the tools, searches had to be tweaked several times because the output was too large for the tool to handle. Related to this issue, I had to very narrowly limit the search date range while performing keyword searches related to CT [computed tomography] imaging. Could this very narrow search date window negatively affect process/results?"

AntConc (Keyword/Phrase Tool | Downloadable Software)
AntConc is a linguistic tool useful for identifying high-frequency/occurrence keyword terms and bound phrases. One TMT librarian had the following concern: "I used [this tool] on a different computer than I had previously (MacBook) and had to override some security settings to get the application to run… (1) Generated a text file from the title and abstracts and imported into AntConc. (2) Analyzed the "Word List" tab, and then (3) selected key terms to see their context via the "Concordance" tab. I found it difficult to determine cut-off threshold for occurrence (selected 4 and did not look below this number). I viewed some "Clusters/N-Grams" for key terms (e.g., abortion). It was difficult to determine when a phrase search for a term should be used instead of a single term (i.e., I knew the phrase "medical abortion" occurred 27 times in my corpus -should I use this phrase in my search or just the term "abortion"?)."

MeSH on Demand (Subject Term Tool | Web-Based Tool)
MeSH on Demand is a National Library of Medicine (NLM) designed to analyze end-user input text and suggest NLM medical subject headings (MeSH). One librarian commented: "I used a section of text from the protocol to identify subject headings. Note: limited to 10,000 characters -had to select a section of the protocol as full protocol was over 20,000 characters. Very quick to analyze text (took seconds to get my subject headings). No additional information about term explosions, so I still had to look up each MeSH term. Articles identified as being relevant were not about pharmacists -I did not add any to my seed set of articles."

PubReMiner (Keyword/Phrase and Subject Term Tool | Web-Based Tool)
PubReMiner analyzes end-user defined PubMed records and generates frequency tables of bibliographic record fields (title, abstract, MeSH terms, journal, etc.). Two librarians commented on using PubReMiner: "Still had to generate a PubMed query. I tried to build a query using my seed set of articles with their PMIDs, but this didn't work (or I couldn't get this to work), so I generated a quick query string: (abortion or mifepristone or misoprostol) AND (pharmacist or pharmacists OR pharmacy OR pharmacies OR chemists). This query resulted in 639 references. Identified additional MeSH terms not found with PubReMiner. Difficult to determine a cut-off threshold for occurrence (selected 10 and did not look below this number)." "I uploaded the list of PMIDs and ran the search. I then selected the following fields from the right-hand side of the screen to manually adjust the search: MeSH, Substance, and WORD TI_AB. These seemed the most useful ways to focus the search for developing terms. In looking over the results term occurrence was the most important factor in selecting potential terms to test in the strategy."

CARROT2 (Visualization Tool | Downloadable Tool)
Carrot2 is a thematic clustering algorithm for small collections of documents. One librarian opined: "…I still find Carrot2 more useful for this [strategy evaluation] stage as it clumps the references into topics, but I did like looking at the way terms were connected in VOS."

VOSviewer (Visualization Tool | Downloadable Tool)
VOSviewer visualizes connections in bibliographic networks. One librarian noted: "For VOSviewer, it took several searches to figure out that the maps I prefer to use are constructed by downloading two Research Information Systems (RIS) files (one including TI, AB, author keywords fields; and the other including MeSH terms and maybe registry name fields). In VOSviewer under file, map, create, create map based on text data, RIS tab (keep ignore fields checked), I could upload each file in turn. Once the map was created, I was looking for non-relevant keywords or MeSH terms that appear as larger bubbles, since these represent the number of occurrences of the term. Sometimes (not always) this has been ideal for finding candidate terms to NOT out to make the search more specific. Of course, one has to test these before removing them from the search..." "I tried other mapping displays but found the map based on text data the easiest to 'read'. I also found that I had to split the results into two files, one for keywords in the title/abstract and the other for MeSH terms. When I didn't do that, the MeSH terms dominated the display to the detriment of everything else."

Summary of Findings
In this project comparing keyword and subject searches done with and without TMTs, we found that the tools did decrease the time needed to develop keyword and subject term strategies compared to conventional approaches. It took experienced librarians less time to develop the TMT searches than UP searches, and this time savings carried across all tasks. This time savings may have come from the ability of TMT librarians to use word frequency tables rather than having to extract the information from texts, or the fact that TMT conceptual groupings might identify "red herrings" quickly and allow for their elimination. It may also stem from UP librarians involvement in the search process for a longer period of time (i.e., attending team meetings and conducting topic and scope refinement searches).
It is uncertain whether TMTs can improve the evaluation step. Only a single librarian reported that she found concepts that could be eliminated using TMTs, while other librarians reported using these tools but not eliminating any terms based on the results. This suggests that using TMTs in the evaluation step may be worthy of additional study.
In terms of whether TMTs improve search recall (sensitivity), we found that over all the UP search was slightly more sensitive across all projects than the TMT search. Neither UP nor TMT searches were perfectly sensitive across all sample reviews, reinforcing that other supplementary search techniques, including handsearching, are still important for comprehensive systematic review searches.
In regard to whether the type of review topic (complex versus simple) makes a difference in the performance of TMTs, we found that there may be more of a role for TMTs in complex reviews. For simple review topics (i.e., single indication-single drug) TMT searches resulted in no unique relevant citations (and missed one relevant study in three of four reviews), but reduced time spent in search design. For complex review topics (e.g., multicomponent interventions) TMT searches identified some unique includable citations and reduced time spent in search strategy development but missed between four and nine relevant citations identified by the UP search.
Finally, TMT librarians' evaluations of the tools indicate they used a variety of combinations of tools and techniques to complete the searches. The most effective and efficient text-mining methods/processes (aka 'best practices') for searchers are still in a formative stage, and we do not feel that the evidence has reached a point where specific guidance on the use of freely available TMTs is meaningful. The research base on these tools is very small, with very few studies on any specific tool. Additionally, the TMTs are evolving so guidance now may not remain relevant. Best practices in the use of TMTs is an area deserving future research.

Strengths and Limitations
This project is, to our knowledge, the first to use a variety of off-the-shelf, freely available tools to evaluate the contribution of TMTs to search strategy design, using professional librarians with a great deal of experience and peer review to ensure all searches (both UP and TMT) were of high quality.
Nevertheless, this study has limitations. For one, the small sample of reviews makes it hard to draw conclusions across projects, and especially to draw conclusions about its performance in specific types of projects. Future research should evaluate these tools across a larger sample and variety of reviews, including simple and complex reviews, types of reviews (e.g., rapid reviews, scoping reviews, etc.), and reviews that span different disciplines. Additionally, not every project team we approached agreed to participate in the study. It is possible that the search processes or review types of those who participated differ in a meaningful way from those who did not. Finally, the sample size of librarians was small and all had years of experience, so we were not able to test whether TMTs level the playing field between inexperienced and experienced librarians.
Another limitation comes from our intent to reflect real-world use of freely available tools, as well as our reluctance to pre-emptively establish "best practices" in TMT use. Thus, we gave very little guidance on how specific TMTs should be used or even which tools should be used from our list. This led to a variety of approaches, which affected the quantitative results. For example, TMT librarians combined usual practice with text-mining to varying degrees, making it hard to narrow down the actual effect of text-mining alone. In addition, the peer review process, while ensuring the searches were of comparable quality, may have led to the addition of terms that were not included based on text mining only.
Another limitation has to do with variations in how librarians operationalized time recording and search development. This led to some outliers in the results. It also did not allow us to calculate how much time the UP and TMT librarians spent identifying seed citations. In addition, one study was an update (R5), while the others were de novo reviews, which may limit the utility of the time estimate for that search in the overall time analysis. For this reason, we did a subgroup analysis without R5 to determine its effect on the sensitivity, NNR, and time spent across all reviews and complex reviews.
Finally, the librarians were relatively inexperienced with the tools at the beginning of the study and overall found that the tools took a lot of time to learn and were at times less functional than hoped. The librarians used a variety of tools, so we can't comment on whether a particular tool is better than another. Librarians with more experience and expertise using specific TMTs might achieve better results.

Implications for Practice and Relevance to Existing Literature
This project expands on Hausner et al.'s previous work. [5][6][7] However, the Wordstat program used in their research is subscription based and is therefore not universally available to librarians in the EPC program. Instead, we used freely available web-based TMTs, which increases the applicability of our findings. Our study looked at the utility of incorporating existing text-mining software (AntConc, PubReMiner, MeSH On Demand, Yale MeSH Analyzer, Carrot2, and VOSviewer) into the process of search strategy design. Overall, we found that these methods were slightly less sensitive, but led to a reduction in time spent developing the search and may reduce the burden on the team in the number of citations that have to be screened.
We were specifically interested to see whether these tools would be helpful in the context of reviews undertaken by the EPC program, which tend to be complex and multicomponent, as compared to simpler one intervention-one indication reviews. We found that incorporating TMTs for complex topics may allow the searcher to find terms that identify citations not found by what Hausner et al. refer to as the conceptual search. 6 Our results are similar to those of Hausner et al., who found that in a sample of Cochrane reviews, an objective search using a text-mining program in Wordstat had similar sensitivity to the original searches. 5 However, unlike Hausner et al., we found that the UP searches appeared to be more sensitive than the TMT searches (but not statistically significantly so), although both UP and TMT searches missed relevant citations. This may reflect differences in the way that we applied text-mining-via the integration of freely available tools with little guidance, as opposed to via an algorithm and statistical package-or it may have to do with the high rigor of the UP searches in this project. Future research should focus on comparing text-mining tool functionality and usability, as well as establishing guidelines and best practices for librarians using freely available TMTs.
One issue that has not been addressed in the literature to date is how to evaluate known relevant citations used in the seed set for their representativeness of the literature, in terms of vocabulary used, known interventions, MeSH terms assigned, and so on. This type of evaluation is important to prevent what Eustace dubbed, "technology-induced bias" (i.e., if the known citation seed set is biased in one or more ways, it is very likely the TMTs output will also be biased). 29 In addition, the ideal size of the seed sets for systematic review searches is presently unknown, nor have methods been developed to aid in evaluating them for bias. Nevertheless, our qualitative analysis suggests that when there are more known citations as a percentage of the literature base in the seed set, the results of the TMTs are more representative of the breadth of the review topic.
Both the Cochrane Handbook 9 and the AHRQ Methods Guide 8 recommend that systematic review searches be comprehensive, striving to identify all relevant citations. Based on the findings of our study, text-mining technology is not ready to be used as the sole process for developing systematic review searches, but the time savings in search design and relatively high sensitivity for complex reviews suggest that this technology may be useful in reviews that do not require maximum sensitivity, such as rapid or scoping reviews. In addition, TMTs are useful in combination with usual practice to find citations missed by the usual search process. Nevertheless, adding a TMT step to the UP search strategy development process will increase the screening burden (NNR) and time required for search development.

Conclusions
Overall, this study found that incorporating TMTs into search strategy development for systematic review projects may reduce time identifying keyword and subject terms but at the cost of potentially decreased sensitivity. For simple topics, TMT searches seemed to have similar (or slightly lower) sensitivity, higher NNR, and required less time than UP searches, but in all cases with overlapping confidence intervals. For complex topics, TMT searches seemed to have lower sensitivity, lower NNR, and required less time than UP searches. Research is needed to improve the utility of off-the-shelf TMTs for use by systematic review search librarians and examine the various ways librarians are using these tools. In addition, research is needed on how to evaluate the corpus of records used by the tools for representativeness (the seed set).

R8
There were not all that many citations to go from on this, but I found the TMT particularly straightforward for this on this topic. I used PubReMiner and Carrot2, as I am most familiar with them. I also looked at the AntConc output but didn't do anything based on it.

Question Review Number
Answer

R4
Subject Headings MeSH on Demand -I used a section of text from the protocol to identify subject headings. Note: limited to 10,000 characters -had to select a section of the protocol as full protocol was over 20,000 characters. Very quick to analyze text (took seconds to get my subject headings). No additional information about term explosions, so I still have to look up each MeSH term. Articles identified as being relevant were not about pharmacists -I did not add any to my seed set of articles.
PubReMiner -Still had to generate a PubMed query. I tried to build a query using my seed set of articles with their PMIDs, but this didn't work (or I couldn't get this to work). So, I generated a quick query string: (abortion or mifepristone or misoprostol) AND (pharmacist or pharmacists OR pharmacy OR pharmacies OR chemists). Query resulted in 639 references. Identified additional MeSH terms not found with PubReMiner. Difficult to determine a cut-off threshold for occurrence (selected 10 and did not look below this number) Key Words AntConc -Used on a different computer than I had previously (MacBook) and had to override some security settings to get the application to run. Exported seed set of 17 articles (below) from PubMed using query: Generated a text file from the title and abstracts and imported into AntConc. Analyzed the "Word List" tab and then selected key terms to see their context via the "Concordance" tab. I found it difficult (as per note on PubReMiner above) to determine cut-off threshold for occurrence (selected 4 and did not look below this number). Viewed some "Clusters/N-Grams" for key terms (e.g., abortion). Difficult to determine when a phrase search for a term should be used instead of a single term (i.e., I knew the phrase "medical abortion" occurred 27 times in my corpus -should I use this phrase in my search or just the term "abortion"?). No response.

Question Review Number
Answer R5 I used Yale MeSH Analyzer and PubReMiner for subject term and keyword text-mining, and VOSviewer for strategy evaluation. I ended up preferring PubReMiner because it presented the results for both keywords (TI, AB) and subjects from my Ovid MEDLINE search in ranked order by the number of times terms were used, which is much easier, faster, and more informative than going through the record layout available in MeSH Analyzer. Once I realized that I could export the PMID list from Ovid in spreadsheet format, and then copy and paste the PMID column from the spreadsheet into PubReMiner it became even easier and faster. For VOSviewer, it took several searches to figure out that the maps I prefer to use are constructed by downloading two RIS files (one including TI, AB, author keywords fields; and the other MeSH terms and maybe registry name fields); then in VOSviewer under file, map, create, create map based on text data, RIS tab (keep ignore fields checked), and then upload each file in turn. Once the map is created, I am looking for non-relevant keywords or MeSH terms that appear as larger bubbles since these represent the number of occurrences of the term. Sometimes (not always) this has been ideal for finding candidate terms to NOT out to make the search more specific. Of course, one has to test these first before removing them from the search...

R7
For the text-mining searches, I utilized several approaches. I used the list of known relevant systematic reviews from the provided

R1
Perfectly balanced (had all needed terms without a lot of junk)

R3
This was a very clean search.

R6
The citations may have been a little broad, but generally seemed good.

R8
Subset of vocabulary terms (had to supplement elsewhere) Comment: I had to add a number of terms that were not identified through TM, probably because there were so few seed citations.

R4
Overly comprehensive (lots of junk terms). It was difficult to determine how representative the known citations were of the topic area.

R2
Perfectly balanced (had all needed terms without a lot of junk). Comment: There was a lot of junk, but this is a complex topic, so I think fewer citations would have led to gaps in the search