If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
We developed a hybrid human expert/artificial intelligence system to keep systematic reviews up to date.
The system continuously surveils PubMed/MEDLINE for new relevant articles, and notifies review authors.
A living abstract is made available, which shows the status of the review in real-time.
In a pilot, the system was effective and reduced workload in a systematic review of COVID-19 vaccination studies.
The aim of this study is to describe and pilot a novel method for continuously identifying newly published trials relevant to a systematic review, enabled by combining artificial intelligence (AI) with human expertise.
Study Design and Setting
We used RobotReviewer LIVE to keep a review of COVID-19 vaccination trials updated from February to August 2021. We compared the papers identified by the system with those found by the conventional manual process by the review team.
The manual update searches (last search date July 2021) retrieved 135 abstracts, of which 31 were included after screening (23% precision, 100% recall). By the same date, the automated system retrieved 56 abstracts, of which 31 were included after manual screening (55% precision, 100% recall). Key limitations of the system include that it is limited to searches of PubMed/MEDLINE, and considers only randomized controlled trial reports. We aim to address these limitations in future. The system is available as open-source software for further piloting and evaluation.
Our system identified all relevant studies, reduced manual screening work, and enabled rolling updates on publication of new primary research.
We developed a system which continuously identifies newly published trial evidence relevant to a systematic review, enabled by combining artificial intelligence (AI) with human expertise.
The semi-automated system found 100% of the relevant studies found by a conventional manual update, during a pilot, when updating a systematic review of COVID vaccination trials.
What this adds to what was known?
Living systematic reviews have been proposed as a new model for keeping evidence syntheses updated. Most current living reviews rely on repeated manual update searches, which are time consuming and laborious.
We show that using a hybrid AI/expert model could lead to lower latency updates, potentially reducing workload, and improving the currency of systematic reviews.
What is the implication and what should change now?
Systems which use AI to automatically notify systematic review authors of new evidence (“push” updates) are feasible, and should be piloted on a wider range of systematic reviews.
Future research should examine how best to adapt to these technologies for use in more complex reviews (particularly reviews of nontrial evidence, and those with complex inclusion criteria).
Journal publishers should investigate models for rapid updating, to enable automated live updates of review status to be published.
For many health conditions and treatments, evidence accumulates rapidly [
]. Systematic reviews identify, appraise, and synthesize all empirical evidence on healthcare topics, and are therefore invaluable for making clinical decisions and informing policy. However, most reviews are static publications, which can become quickly out of date as new primary research is published [
]. For the reader, it is currently impossible to determine whether any particular systematic review is up to date, or whether new important research was published after the searches were conducted. For authors, it is unclear whether it is worth the effort of updating their review, given uncertainty about whether new evidence exists which might change their conclusions [
]. For commissioners and policy makers, it is unclear when and whether to fund updates of systematic reviews.
As an example, consider the topic of COVID-19 treatments or vaccines. New studies are being rapidly conducted and published on these topics. A “static” systematic review on either topic, with a search date of 6 months ago (from the time of this writing), is likely to have missed critical new findings, and failed to provide an account of the current science. Given the pace of new published trial evidence in COVID-19, a conventional systematic review would likely become outdated before it was ever published.
Living systematic reviews have been proposed as one model for keeping rigorous syntheses current with evolving evidence [
]. The idea is to update syntheses as new evidence emerges, ideally with low latency. For COVID-19 specifically, a number of living reviews are currently being maintained on both treatments and vaccines [
]. To date, living systematic reviews have been achieved by repeating a conventional systematic review update on a frequent basis (updating searches, say, monthly or weekly), screening the results, and extracting data [
]. This process still depends on review teams having to actively run searches and find new studies (a “pull” model), and will result in some lag between manual search and identification of relevant studies. In addition, conventional database searching can yield large numbers of abstracts which require screening.
The process of conducting the search, and screening the results to identify potentially relevant abstracts is a large proportion of the work to conduct a systematic review. The findings of this work (whether there are new studies identified or not) are important to readers and policy makers. The main mechanism to provide this information to users is to publish a “full” update. This process, particularly for “empty updates,” is time consuming. There is a need to identify new evidence relevant to existing systematic reviews in a more efficient and less manual way. In addition, there is a need to have a formal way to represent the currency of existing systematic reviews, based on whether all relevant evidence has been incorporated.
There has been much recent research attention on how to use artificial intelligence (AI) systems to automate (or semi-automate: where AI systems are combined with human experts) living updates [
]. The most advanced technology in this respect is the use of machine learning (ML) to prioritize studies for screening, which has been found to be accurate and efficient in a number of methodological studies [
Here, we describe a hybrid system that integrates ML and natural language processing (NLP) methods with human expertise to translate static systematic reviews into living reviews. The system automatically monitors research databases for new, relevant research to a systematic review, and notifies the review authors. This “push” model differs fundamentally from the standard approach to updating reviews, which depends on review authors taking the initiative to periodically search for newly published evidence. We present a formative evaluation of the system, comparing the reliability of (semi-)automatic systematic review updates prospectively with traditional manual update searches for an ongoing systematic review of COVID-19 vaccine efficacy. The system—a collection of trained models and a prototype web interface with which to interact with them—is free and open-source. This work constitutes a step toward translating the idea of living reviews into practice.
Before using RobotReviewer LIVE, users develop a review question, and manually (via baseline full systematic manual search and manual screen of abstracts) identify baseline included studies. RobotReviewer LIVE can be used prospectively with new reviews (to replace the need for update searches before publication), or to bring existing reviews up to date—the only requirement to use RobotReviewer LIVE is the availability of the baseline search results and baseline included abstracts.
To start the process, the user registers their review on RobotReviewer LIVE using the user interface shown in Fig. 1. The system then surveils the medical literature for newly published randomized controlled trials (RCTs) likely to meet the inclusion criteria of the systematic review. Topic experts screen the matching abstracts as and when they are identified, and their inclusion decisions produce live status updates of the systematic review. We illustrate the steps of the updating process in Fig. 2, and describe each in detail below. These steps are run continuously after publication of the initial review.
2.1 Finding new clinical trial reports
As a first step, we monitor PubMed daily for new clinical trial reports via the Trialstreamer database [
]. Trialstreamer identifies all reports of RCTs via a validated ML model (recall 0.97, precision 0.52). Abstracts describing RCTs then go on to detailed automatic extraction of trial characteristics (descriptors of participants, interventions, and outcomes), sample sizes, author conclusions, and indicators of methodological bias. We have described the NLP data extraction methods used in Trialstreamer in detail previously [
In this step, we move from the set of all RCTs (>750,000 at the time of writing) to a topic-focused (but highly sensitive) set for subsequent machine classification. This topic-focused set of RCTs is created by selecting broad topic terms relevant to the question of the systematic review. The available terms are derived from the MeSH vocabulary alongside an indicator of whether the term describes the Population, Interventions, or Outcomes of the trial. The RobotReviewer LIVE interface allows reviewers to select relevant terms using an autocompleting text box.
The topic-focused set might only include articles that mention a particular condition of interest. The topic-focused set is assumed to be much broader than the set of studies that would be retrieved with a conventional search for a specific systematic review question. In the case of the COVID-19 vaccines review, we included articles in the topic-focused set only if the abstracts contained a mention of COVID-19 or a synonym from a synonym list we generated automatically by minimally processing terms from the Unified Medical Language System Metathesaurus. We have described the method for generating these terms previously [
Our goal in this step is to automatically filter out the vast majority of irrelevant articles. We have found previously that ML models with sufficient recall for systematic reviews (which aim to retrieve all research fulfilling their inclusion criteria) will, even in the best case, retrieve a high fraction of false positives. We therefore aim to develop a model with near 100% recall, but add a later screening step by a human expert to remove false positives. A lower precision is therefore acceptable so long as the volume of articles for manual screening is manageable. To achieve this, we train a classification layer on top of “BERT”-based [
] representations of input articles. BERT (Bidirectional Encoder Representations from Transformers) is a multilayer neural network language model, which is “pretrained” using a large volume of unlabeled plain text documents (e.g., the full contents of Wikipedia, and large collections of books freely available on the internet). Here specifically, we use the BERT variant BioMed-RoBERTa, which is optimized for scientific research articles by conducting the pretraining on a large collection of scientific articles obtained from Semantic Scholar [
]. We make use of the human inclusion and exclusion decisions from the original systematic review to train this model.
In the task of abstract screening for systematic reviews, there are typically far fewer relevant than irrelevant citations (i.e., most candidate articles retrieved via search will not meet review eligibility criteria). This creates class imbalance [
] in the training set, which can in turn result in poor model sensitivity, because overall predictive loss can be largely minimized by predicting that all instances belong to the majority class (i.e., all abstracts are irrelevant). Following prior work [
] on methods for achieving a better balance between sensitivity and specificity in imbalanced scenarios, we resample the data to induce a balanced distribution during model training. We construct balanced batches during Stochastic Gradient Descent (SGD) using weighted sampling, such that minority examples (relevant citations) are assigned weights inversely proportional to the prevalence of the minority class.
We trained our BioMed-RoBERTa model for five epochs using SGD with a learning rate of 10−3 and a momentum of 0.9, yielding a final model that recalled 100% of relevant articles with 40% precision (Area Under the Receiver Operating Characteristic curve 0.97) when evaluated on 10% of the dataset which was held out from training. Our model code is available on our project GitHub page (https://github.com/bwallace/RobotScreen/).
2.4 Validation of results by systematic review authors
If the steps above retrieve new potentially relevant articles, systematic review authors are notified by email and invited to screen new abstracts for relevance. This step aims to remove any false positives (i.e., ineligible articles deemed relevant by the model). Although conventional systematic review searches might include hundreds or thousands of articles for manual review, the automatic system (in Step 3) aims to remove the vast majority of these articles. In the case of our example topic (which was subject to particularly high rates of research and publication during the study period), the system identified on average three potentially eligible abstracts per week which were then pushed to the review’s lead author. Review authors can screen the new studies by signing on to the website (Fig. 3, Fig. 4).
2.5 Publication of live status update
We automatically publish a live update, which makes use of the latest information from both the automated and manual evidence screening (see example in Fig. 2). This text is designed to be displayed as an additional section in the structured abstract, with the header “Automatic updates.” We display the full abstract including the live update section on our website, and also make this available via a REST API so that external journal publishers could opt to display a live, updated version of the abstract as part of the primary research article in future.
We provide meta-data about new studies (including numbers of studies screened, how many were deemed relevant by the topic expert, and numbers of trial participants). This numerical meta-data is collected from our screening records, and from the structured data in the Trialstreamer database (which has been automatically extracted using NLP models), and displayed following a template.
As part of this step, we also have explored the use of automatic narrative summaries of newly included studies. We aimed to produce a brief summary of the new studies’ findings to be presented alongside the templated meta-data described in the main paper. We provide further details about this method and results in the Appendix.
2.6 Evaluation: prospective case study with a COVID-19 vaccination review
We evaluated the system prospectively in comparison to a conventional manually updated living systematic review on COVID-19 vaccination evidence. The baseline full systematic manual searches for this review were completed and screened on February 9, 2021.
We ran our comparative evaluation from February 9, 2021 to August 1, 2021. During this period, the review authors performed conventional manual update searches, and we ran the semi-automated system in parallel. We calculated recall with respect to the combined set of included articles from the manual and automatic update systems. Screening of the abstracts found by RobotReviewer LIVE was done by an independent member of the review team, who was not involved in the screening of the manual update searches.
Due to time taken to screen abstracts on the manual update, the last manual update search done during the evaluation period was on July 1, 2021. The “push” model used by our automated system, where smaller numbers of abstracts were sent to be screened on the day of publication, meant that there was close to no lag between abstract publication and screening, and the live status updates included abstracts published up to and including August 1, 2021. We present results separately until July 1 (which represent a direct comparison of automatic update search performance vs. conventional manual update searches at intervals), and from July 1, 2021 to August 1, 2021 (which evaluate any advantage in screening efficiency with automation) to allow a fair comparison.
We present a Preferred Reporting Items for Systematic Reviews and Meta-Analyses flow diagram comparing the screening approaches in Fig. 5. The baseline (manual) version of the review search was conducted in February 2021. This yielded 4,493 abstracts, of which 38 both met eligibility criteria and were reports of RCTs.
Manual update searches retrieved 135 abstracts; in contrast, the automated system retrieved 56. Both strategies resulted in the same 31 included abstracts after screening.
We have presented a system for identifying new evidence to include in systematic reviews, and for producing live abstract updates on the currency of systematic reviews. RobotReviewer LIVE combines AI (ML/NLP) with human expertise, and allows new studies to be incorporated in published review reports quickly after publication. We have made the software, ML models, and data needed to implement the system freely available as open-source software. We also provide a prototype of RobotReviewer LIVE that features a simple user interface, which should allow systematic review authors to produce live updates for their existing “static” systematic reviews. This prototype is also available as open-source code.
We provide an easy-to-use interface to allow experts to validate the automatic search results—potentially providing substantial efficiencies in the updating process, while still providing the assurances afforded by expert verification. In practice, converting a new conventional systematic review to a “living” equivalent using the system could be done in a matter of minutes. We make the technology available as open source, together with a REST API to enable live updates to be used inline in published journal articles, embedded in the websites of third party publishers. Even where a review is not actively kept up to date, this may allow interested individuals to see estimates of the amount of relevant evidence published since the time said review was completed. In the future, this platform may also permit “crowdsourced” maintenance of systematic reviews.
Related systems have been developed and evaluated, notably the Cochrane “Evidence Pipeline” and Centralised Search Service [
]. These projects also monitor research databases (using a combination of ML identification of RCTs and crowdsourcing) and notify Cochrane review groups (which each typically manage tens of systematic reviews on a common clinical theme) when new research is published relevant to their theme. In contrast, our system is designed to manage updates for individual systematic reviews.
In our prospective case study, the automated method identified all of the includable abstracts found manually. We continued to run the automated system for an additional month after the evaluation period (until August 1), because the review team conducted the manual update search earlier than we had expected. In this month, the automated system found 12 additional abstracts which were deemed includable. This illustrates the advantage of the low latency “push” screening model, especially for topics such as COVID-19 vaccination, with rapid publication rates.
One criticism of systematic review automation tools previously is that they are often found as discrete, scattered pieces of academic code which require substantial technical expertise to use in practice [
]. To overcome this problem, we have produced an easy-to-use web interface which should allow users to create a “living” version of a systematic review with minimal effort (Fig. 3, Fig. 4).
This technology is still emerging, and users should be aware of important limitations. Although the performance on this case study is strong, the evaluation review is ideal for such technology. The review question is precise, and concerns a well-defined intervention and health condition, both of which are easy to capture in the structured vocabularies used in the Trialstreamer database. In the midst of a pandemic, there are also large numbers of eligible studies being published (whereas precision is likely to reduce in any search as the prevalence of eligible studies decreases—no matter whether manual or automated). We have presented a single case study, and it is likely that performance will vary particularly for more complex reviews.
Currently, we make use of the Trialstreamer database, which at present is limited to articles describing RCTs. We intend to make additional article types available in future; at present the system is limited for use to systematic reviews of intervention trials due to the data sources used. At present, we make use of articles from PubMed only—we are unable to access additional proprietary databases such as EMBASE which might (modestly) harm the recall of the system [
]. Overall, although the individual components of the system have been extensively validated, this report describes the only validation using a conventional systematic review as a comparator. The reliability of the system in general (particularly for reviews that deviate substantially from the format of the current evaluation) requires further study.
Manually updating systematic reviews is time consuming and laborious, meaning many conventionally produced reviews become quickly out of date. We hope that further evaluation and development of the ideas and methods presented here will bring the goal of dynamic publication of live evidence synthesis updates a step closer into practice.
CRediT authorship contribution statement
Iain J Marshall: Conceptualization, Methodology, Software, Validation, Formal analysis, Writing – original draft, Writing - review and editing, Supervision, Project Administration, and Funding acquisition. Thomas A. Trikalinos: Conceptualization, Methodology, Validation, Writing – review and editing, Supervision, and Funding acquisition. Frank Soboczenski: Conceptualization, Methodology, Software, Validation, and Writing – review and editing. Hye Sun Yun: Methodology, Software, and Writing – review and editing. Gregory Kell: Methodology, Software, and Writing - review and editing. Rachel Marshall: Conceptualization, Methodology, and Writing – review and editing. Byron C. Wallace: Conceptualization, Methodology, Software, Validation, Formal analysis, Writing – original draft, Writing – review and editing, Supervision, Project administration, and Funding acquisition.
State of the evidence: a survey of global disparities in clinical trials.
Funding Statement: This work has been supported by the National Institutes of Health under the National Library of Medicine grant R01-LM012086 and by the National Science Foundation under grant 1750978 : “CAREER: Structured Scientific Evidence Extraction: Models and Corpora.” The work has also been partially supported by the UK Medical Research Council , through its Skills Development Fellowship program, fellowship MR/N015185/1 .
Conflict of interest: The authors declare no conflicts of interest.