If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Text-mining in electronic healthcare records can be used as efficient tool for screening and data collection in cardiovascular trials: a multicenter validation study
Corresponding author. Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Universiteitsweg 100, 3584 CG Utrecht, the Netherlands. Tel.: +31 (0)6 12 43 45 58.
Department of Epidemiology, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands
Department of Cardiology, Meander Medical Center, Amersfoort, the NetherlandsDepartment of Cardiology, Division Heart & Lungs, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands
Department of Epidemiology, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands
Department of Epidemiology, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands
Department of Medical Humanities, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands
Department of Cardiology, Division Heart & Lungs, University Medical Center Utrecht, Utrecht University, Utrecht, the NetherlandsInstitute of Cardiovascular Science, Faculty of Population Health Sciences, University College London, London, United KingdomHealth Data Research UK and Institute of Health Informatics, University College London, London, United Kingdom
Department of Epidemiology, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands
Department of Epidemiology, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, the NetherlandsDepartment of Cardiology, Meander Medical Center, Amersfoort, the NetherlandsDutch Network for Cardiovascular Research (WCN), Utrecht, the Netherlands
This study aimed to validate trial patient eligibility screening and baseline data collection using text-mining in electronic healthcare records (EHRs), comparing the results to those of an international trial.
Study Design and Setting
In three medical centers with different EHR vendors, EHR-based text-mining was used to automatically screen patients for trial eligibility and extract baseline data on nineteen characteristics. First, the yield of screening with automated EHR text-mining search was compared with manual screening by research personnel. Second, the accuracy of extracted baseline data by EHR text mining was compared to manual data entry by research personnel.
Results
Of the 92,466 patients visiting the out-patient cardiology departments, 568 (0.6%) were enrolled in the trial during its recruitment period using manual screening methods. Automated EHR data screening of all patients showed that the number of patients needed to screen could be reduced by 73,863 (79.9%). The remaining 18,603 (20.1%) contained 458 of the actual participants (82.4% of participants).
In trial participants, automated EHR text-mining missed a median of 2.8% (Interquartile range [IQR] across all variables 0.4–8.5%) of all data points compared to manually collected data. The overall accuracy of automatically extracted data was 88.0% (IQR 84.7–92.8%).
Conclusion
Automatically extracting data from EHRs using text-mining can be used to identify trial participants and to collect baseline information.
Compared to conventional methods, automated text-mining in electronic healthcare records (EHRs) can substantially reduce the number of patients that need to be screened for trial enrollment and the amount of labor to collect data.
What this adds to what was known
•
Previous studies mining parts of EHRs showed mixed results for text-mining methods and mainly focussed on observational and registry data-collection.
•
This study shows that integral text-mining of EHRs yields good results for trial participant screening and data-collection.
What is the implication and what should change now
•
Clinical trials should consider automated text-mining methods to supplement current participant screening and data-collection methods.
1. Introduction
Clinical research requires highly detailed information on large numbers of subjects, often acquired by many investigators and supporting staff. In particular, prospective research such as registries and randomized clinical trials (RCT) need to comply with high standards of data validity [
The European medicines agency working group on clinical trials conducted outside of the EU/EEA. Reflection Paper Ethical GCP Aspects Clin Trials Med Prod Hum Use Conducted Outside EU/EEA Submitted Marketing Au.
]. Scientific and regulatory requirements make such endeavors laborious and increase costs to a level only large companies are able to meet.
Cardiovascular outcome trials with moderate to low absolute risks nowadays require over 10,000 participants and are estimated to cost between 35,000 and 45,000 US dollars per participant, with total costs up to half a billion US dollars for conduct [
]. Standing practice for clinical trials is that dedicated personnel enters source data in distinct (electronic) clinical report forms (CRFs). This data, however, is generally already collected in clinical care and available in electronic healthcare records (EHRs), thus creating overlapping copies of data that are already available (Fig. 1A).
Fig. 1Layers of data collected during trials (left: required trial data collection when not using EHR data in perspective to data available in EHR; right: (theoretical) required trial data collection when using EHR data).
]. Conventional data collection methods generally involve retrieving information through researcher-patient interviews and manual data extraction. After retrieval, data is then entered manually in electronic data capture (EDC) systems as part of CRFs. Data quality is guaranteed up to a certain level by automated control processes and internal and external monitoring [
]. If EHR data are to be used to identify participants or as an alternative data source, these data should be of sufficient quality. High data quality is paramount, yet will differ per objective. The accuracy level is relative to the nature of the data. Outcome data that is used to estimate a treatment effect requires higher fidelity than baseline data [
We hypothesized that patients eligible for trial participation can be effectively identified on information already present in EHRs using automated text-mining. Second, we hypothesized that the majority of data collected for the purpose of the trial is also already available in EHRs. If extracted automatically with acceptable accuracy, the extensive manual entry by investigators in EDCs could be reduced. If true, data collection efforts could focus on information not available from EHRs and reduce manual EHR-to-EDC data duplication that is now common (Fig. 1).
2. Methods
This study was a multicenter, multi-EHR-vendor validation study to assess the accuracy of automated EHR text-mining for trial participant screening and baseline data collection. As a reference standard, we used manual participant screening and data collection by manual data entry in EDCs, which is the current standard for most RCTs.
First, all patients who visited the outpatient cardiology clinics of the three participating medical centers during the recruitment phase (October 1, 2016, to December 1, 2018) of the LoDoCo2 trial were automatically and anonymously screened retrospectively for eligibility of participation in the trial according to its inclusion and exclusion criteria (Fig. 2). The yield of eligible patients via this method was compared to those actually included in the trial by manual screening for trial participation. Second, baseline characteristics were automatically collected for all trial participants, and accuracy was assessed against manually collected data.
Fig. 2Overview of the process of conventional and automated participant identification and data collection and the associated estimated time of these processes.
Conventional participant identification and data collection methods used in the international clinical trial LoDoCo2 were used as the reference standard. The LoDoCo2 trial was chosen as it represents a prototype large international multicenter cardiovascular outcome trial.
In short, the LoDoCo2 trial was a randomized, investigator-initiated international, multicenter study that investigated whether colchicine 0.5 mg once daily as compared to placebo in patients with stable coronary artery disease reduces the incidence of major adverse cardiovascular events [
This study was based on the data of patients visiting the cardiology outpatient clinics of three large Dutch medical centers. The medical centers were selected to represent the major EHR software vendors in the Netherlands (Epic [Hospital A], ChipSoft [Hospital B], CSC Care solutions [Hospital C]; cumulatively used in 80% of the Dutch hospitals and almost 10% of the hospitals worldwide [
Participants of the LoDoCo2 trial were retrieved on their trial identification number and unique on-site identifier as recorded in their EHR files. Participants for which no trial identifiers were reported in the EHR were ignored since they could not be linked to CRF data functioning as the reference standard.
2.3 Participant identification methods
2.3.1 Automatic, using text-mining from EHRs
A Boolean retrieval query to obtain the required data was developed in adherence with the eligibility criteria of the LoDoCo2 trial by two authors (WBvD and ATLF) (Supplement 1a). For developing the query a graphic user interface data mining tool with text-mining features was used (CTcue, version 2.0.12; Amsterdam, The Netherlands). This data mining tool integrally searched structured and unstructured EHR data (including clinical letters, in-hospital consultations, procedures, diagnostic tests, and drug prescriptions).
Both authors who developed the query were considered to have content expertise from their medical backgrounds and had extensive experience in query development. Additionally, one of these authors (ATLF) was also a lead investigator of the LoDoCo2 trial.
The query consisted of regular expressions of the eligibility criteria as given by the LoDoCo2 trial, their synonyms, and negations (e.g., “no hypertension” instead of “hypertension”). Synonyms were added using the automatic synonym expander built into the data mining tool and supplemented with synonyms and abbreviations commonly used by the query developing authors (Supplement 1a).
For precluding automatic retrieval of information entered in the EHR after trial participation, only data registered in EHRs prior to the screening of the trial were used. No site-specific optimizations were added to the query, except for the retrieval of trial participants and periprocedural drug recognition adjustments. To approximate data collection as would have been performed in the trial, the most recent status on any data point before entering the trial was taken. Additionally, drug use data were limited to data registered within a year of enrollment. When no measurement of a variable was found, it was assumed to be absent for the participant.
2.3.2 Manual participant identification, as used in the LoDoCo2 (reference standard)
Trial investigators of the LoDoCo2 trial used two steps to identify trial participants. First, manual screening was performed for eligibility using the EHR files prior to their outpatient clinic visit. Second, patients were interviewed face-to-face to verify eligibility and ask for participation. After providing informed consent, participation in the trial ensued.
2.4 Baseline data extraction methods
2.4.1 Automatic, using text-mining from EHRs
A query was developed to automatically collect data form the EHRs on nineteen variables, which contained information about demography, medical history, procedure history, and drug use as reported in the baseline table of the trials’ methods paper (Supplement 1b). For the development of this query the same methods were employed as used for the participant identification query.
2.4.2 Conventional data extraction, as used in the LoDoCo2 trial (reference standard)
In the LoDoCo2 trial, data were collected manually during face-to-face baseline interviews at trial enrollment with participants. Interview data was first recorded as source data on-site and afterward entered into the trial’s EDC system.
2.5 Analysis
2.5.1 Participant identification efficiency
For each site, the number of unique patient visits during the trial recruitment period, number of patients automatically identified as potentially eligible, and number of patients enrolled in the trial were recorded and compared to the number of patients enrolled in the actual trial. For both methods, a theoretical yield was calculated based on the patients needed to screen for identification. For determining the yield of the automatic participant identification, the number of enrolled trial participants was used as a proxy as it was not possible to assess how many of the automatically identified potentially eligible patients would have been enrolled retrospectively.
2.5.2 Data collection accuracy
Results of automated EHR text-mining were compared to manually collected trial data on their distributions and accuracy (defined as [true positive data points + true negative data points]/all data points) on an individual patient level. For clarity and to show agreement between EHR vendors, accuracies of the various medical centers were plotted against the overall accuracy in a forest plot.
3. Results
3.1 Participant identification efficiency
A total of 92,466 patients visited the cardiology outpatient clinic of the three study centers during the recruitment period of the LoDoCo2 trial (October 1, 2016 to December 1, 2018). Of these, 568 patients (0.6%) were enrolled in the LoDoCo2 trial (Fig. 3, Table 1).
Fig. 3Eligible patients identified with conventional and automated participant identification.
For the LoDoCo2 trial, all patients visiting the cardiology out-patient clinics were screened on trial eligibility. Automated EHR data screening resulted in a reduction of 73,863 (79.9%) patients that needed to be screened for trial participation. The remaining 18,603 (20.1%) contained 458 of the actual trial participants (82.4% of participants). Further inspection of the 110 (17.6%) trial participants missed by the data mining tool showed that in the automatically retrieved data on one or more inclusion or exclusion criteria were missing (no proof of coronary artery disease [found as a coronary angiography; CT coronary angiography or Coronary Artery Calcium Score]: n = 38; no known renal function: n = 41; date of previous Coronary Artery Bypass unknown: n = 41). Characteristics of missed participants did not differ substantially from identified participants (median difference of all variables 1.6%, IQR 3.1%); values were therefore assumed to be missing at random.
3.2 Data collection accuracy
Of the 568 trial participants, 540 (95.1%) enrolled trial participants were automatically retrieved on their trial identification number or unique on-site identifier with the data mining tool.
On an aggregate level, availability of baseline characteristics for participants using automated EHR text-mining differed by 2.8% (median; IQR across all variables 0.4–8.5%) with manually collected trial data (Table 2; center-specific distributions are presented in Supplement 2a). Notably larger differences between automated EHR text-mining data and manually collected trial data were found for hypertension (26.2%), antiplatelet therapy (29.1%), and beta-blocker use (24.4%).
Table 2Distributions and accuracy of baseline variables automatically collected from EHR data compared to trial data
On an individual participant level, automated EHR text-mining data showed 88.0% accuracy (median; IQR 84.7–92.8%) when compared to the conventionally collected trial (Table 2; center-specific accuracy is presented in Supplement 2b). Overall, 9.8% of the data extracted from EHRs were false positive (i.e., data on a variable present in EHR data and not present in trial data), and 3.1% false negative (i.e., data on a variable not present in EHR data and present in trial data) (Table 3; for contingency tables of different medical centers see Supplement 2c). Of all data points, positive predictive value was 0.928, negative predictive value was 0.937, sensitivity was 0.806, specificity was 0.827, and F1-score was 0.863 (for test performance scores of individual variables, see Supplement 2d). The lowest accuracies were found for hypertension (62.6%), antiplatelet therapy (68.8%), and beta-blocker use (73.3%). Accuracies for hypertension, antiplatelet therapy, and beta-blocker therapy differed between the participating medical centers, with hypertension ranging from 52.2% to 64.2%, antiplatelet therapy from 60.3% to 86.4% and beta blocker use ranging from 66.4% to 84.7%.
Table 3Overall contingency table of the accuracy of collected baseline variables
This study shows that it is feasible to use automated EHR text-mining to identify eligible trial participants and collect baseline data. By identifying eligible patients, only 20.1% of the original 92,466 visiting patients had to be screened manually for trial inclusion. In this 20.1%, 82.4% of the participants were present. Data extracted from EHRs showed an average accuracy of 87.1% to the manually collected data of the LoDoCo2 trial.
Several studies have investigated the opportunities of using EHRs for recruitment and data collection in clinical research and trials, but only a few compare EHR data to trial data [
Evaluation of data completeness in the electronic health record for the purpose of patient recruitment into clinical trials: a retrospective analysis of element presence.
Evaluation of data completeness in the electronic health record for the purpose of patient recruitment into clinical trials: a retrospective analysis of element presence.
]. A study from 2013 assessed the completeness of structured EHR data to trial eligibility criteria originating from multiple trials, showing that 35% of the patient characteristics derived from the eligibility criteria were available in structured EHR data at the time [
Evaluation of data completeness in the electronic health record for the purpose of patient recruitment into clinical trials: a retrospective analysis of element presence.
]. Studies automatically text-mining EHRs integrally, however, reported more favorable results with accuracies comparable to those found in this study [
4.1 Implications for using EHR data in clinical research
When the quality of EHR data extraction is of an acceptable level, it could improve efficacy in trial conduct. As such, EHR data collection would allow the reallocation of resources and a reduction in execution costs [
Using automated EHR text-mining, we were able to identify patients potentially eligible for trial participation. These results are in line with the results found by previous studies [
The correlation between the number of eligible patients in routine clinical practice and the low recruitment level in clinical trials: a retrospective study using electronic medical records.
]. In participant recruitment, a high positive predictive value using automated EHR participant screening (i.e., most patients screened as positive also enroll in the trial) would maximize efficacy improvements [
]. Our study indicates that automated EHR screening has the potential to identify large numbers of eligible participants in a time-efficient and cost-efficient manner (data not shown).
4.1.2 Data collection accuracy
Since baseline characteristics are not always included in final outcome analysis generally, small errors in these data can be acceptable when counterbalanced by improved efficiency. Incorporation of baseline characteristics measured with error in the analyses would only have an effect on research validity when accuracy is not randomly distributed across intervention groups. If random, it could affect the precision of effect estimates after adjustment [
Accuracy of automated EHR data collection depends on the amount of missing data and measurement errors. First, variables collected from data can be missing because they were not recorded or not extracted from the data. Physicians often measure and register only what they consider relevant for delivering care. Consequently, (ordinary) characteristics that are desired in clinical research are not registered [
]. Whether this will lead to problems in identifying patients eligible for trial participation differs per variable and context. Missing data on smoking, for example, will be of less value than missing data on coronary revascularization since clinicians will not always ask about smoking but may be expected to document coronary interventions [
]. These factors make it harder to extract data due to ensuing variability in how characteristics are reported. Substantive knowledge on the topics of data to be extracted is therefore still essential.
Second, EHR data could contain more measurement errors because they were not collected and measured in a standardized format, as is generally done in conventional trial data collection. EHR data can, for example, be hampered in its currency (i.e., stored variables are out of date) due to irregular visits of patients. These remain challenges of the use of EHR data that should be addressed in future research.
Third, relevant information encompassed in the EHR can still be missed due to interindividual differences in reporting or reporting errors (abbreviations, misspelling, synonyms). Improved intelligent text-pattern recognition systems might reduce the risk of missing data.
4.2 Future perspectives
EHR data collection will probably be best used in conjunction with other data collection methods instead of replacing them. In the design of trials, investigators can take automated and manual EHR data collection into account in the design phase of the trial. Our results show that automated EHR screening for eligible patients might result in a somewhat different study population compared to the population currently enrolled. Effects on generalizability should be considered, although the resulting patient population might well reflect a more real-world sample of participants if their characteristics differ from the original study population [
]. Benefits of increased efficiency in the identification of eligible patients might make it easier to enroll patients, and as such, reach the desired number of inclusion faster than with conventional participant recruitment.
5. Limitations of this study
This study combined data from multiple medical centers, all using different EHR software vendors, and shows consistent results for the broad range of systems. Yet, three main limitations should be noted on it.
First, the accuracy of the information on hypertension, antiplatelet therapy, and beta-blockers deviated, notably from collected trial data. Deviation between EHR and trial data was probably due to hypertension being defined as “using antihypertensive drugs” in the LoDoCo2 trial, which was hard to mirror in the EHR search query. Deviations on drug prescriptions and use variables were mainly attributed to registered timeframes of drugs and insufficient indexing of hospital drug prescription systems by the data extraction tool. Moreover, hospital physicians might not have registered home prescriptions for all patients, adequately deviating results on drugs too.
Second, it was assumed that all patients visiting the out-patient cardiology clinics of the three hospitals were screened conventionally for participation in the LoDoCo2 trial. If this was not the case expected yield of automated participant identification would be overestimated in this study.
Third, our Boolean query was not enhanced with natural language processing algorithms because of the limitations of the employed data mining tool and language-specific limitations. Text-mining was, therefore, interpreted broadly as the ability to automatically extract information from unstructured texts.
6. Conclusions
Data extracted from EHRs using text-mining can be used to identify patients eligible for trial participation and for the collection of baseline characteristics. This method might substantially reduce time and costs related to recruitment and data collection in clinical trials. Whether this premise can be realized depends on whether small accuracy losses are deemed acceptable in the context of the trial that is performed. This study focused on patient eligibility screening and participant baseline data collection; future research is needed to assess the quality of outcome detain EHRs.
Acknowledgments
The authors would like to show their gratitude to Marjan van Doorn (Meander Medical Center) and Erik Badings (Deventer Hospital) for assisting with the data collection for this study.
The European medicines agency working group on clinical trials conducted outside of the EU/EEA. Reflection Paper Ethical GCP Aspects Clin Trials Med Prod Hum Use Conducted Outside EU/EEA Submitted Marketing Au.
Evaluation of data completeness in the electronic health record for the purpose of patient recruitment into clinical trials: a retrospective analysis of element presence.
The correlation between the number of eligible patients in routine clinical practice and the low recruitment level in clinical trials: a retrospective study using electronic medical records.
Funding: This work was supported by the Netherlands Organisation for Health Research and Development (ZonMW) (grant number 91217027). A. Sammani was funded by the University Medical Center Utrecht Alexandre Suerman Stipendium. Folkert Asselbergs was supported by UCL Hospitals NIHR Biomedical Research.
Conflicts of interest: Rieke van der Graaf reported being a member of an independent ethical advisory committee to Sanofi. All other authors did not report any conflicts of interest.
Author statement: 1) Conceived and designed the experiments: van Dijk, Fiolet, Schuit, Grobbee, Groenwold, Mosterd. 2) Performed the experiments; van Dijk, Fiolet, Sammani, Groenhof. 3) Analyzed and interpreted the data; van Dijk, Fiolet, Schuit. 4) Contributed reagents, materials, analysis tools or data: van der Graaf, de Vries, Alings, Schaap, Asselbergs. 5) Wrote the paper: van Dijk, Fiolet.