Text-mining in electronic healthcare records can be used as efﬁcient tool for screening and data collection in cardiovascular trials: a multicenter validation study

Objective: This study aimed to validate trial patient eligibility screening and baseline data collection using text-mining in electronic healthcare records (EHRs), comparing the results to those of an international trial


Introduction
Clinical research requires highly detailed information on large numbers of subjects, often acquired by many investigators and supporting staff. In particular, prospective research such as registries and randomized clinical trials (RCT) need to comply with high standards of data validity [1,2]. Scientific and regulatory requirements make such endeavors laborious and increase costs to a level only large companies are able to meet.
Cardiovascular outcome trials with moderate to low absolute risks nowadays require over 10,000 participants and are estimated to cost between 35,000 and 45,000 US dollars per participant, with total costs up to half a billion US dollars for conduct [3,4]. A major part of these costs is attributable to participant recruitment and follow-up, for a large part comprising data collection [5,6]. Standing practice for clinical trials is that dedicated personnel enters source data in distinct (electronic) clinical report forms (CRFs). This data, however, is generally already collected in clinical care and available in electronic healthcare records (EHRs), thus creating overlapping copies of data that are already available (Fig. 1A).
Automated EHR data-mining may provide a valuable method to complement or even substitute current data collection methods [7], which could save up to one-third of recruitment costs [8]. In recent years, several supervised patient-diagnosis registries with labeled clinical data emerged to improve trial efficiency [9]. The use of automatically collected EHR data in trials, however, is still very limited [10]. Conventional data collection methods generally involve retrieving information through researcherpatient interviews and manual data extraction. After retrieval, data is then entered manually in electronic data capture (EDC) systems as part of CRFs. Data quality is guaranteed up to a certain level by automated control processes and internal and external monitoring [11]. If EHR data are to be used to identify participants or as an alternative data source, these data should be of sufficient quality. High data quality is paramount, yet will differ per objective. The accuracy level is relative to the nature of the data. Outcome data that is used to estimate a treatment effect requires higher fidelity than baseline data [12].
We hypothesized that patients eligible for trial participation can be effectively identified on information already present in EHRs using automated text-mining. Second, we hypothesized that the majority of data collected for the purpose of the trial is also already available in EHRs. If extracted automatically with acceptable accuracy, the extensive manual entry by investigators in EDCs could be reduced. If true, data collection efforts could focus on information not available from EHRs and reduce manual EHRto-EDC data duplication that is now common (Fig. 1).

Methods
This study was a multicenter, multi-EHR-vendor validation study to assess the accuracy of automated EHR textmining for trial participant screening and baseline data collection. As a reference standard, we used manual participant screening and data collection by manual data entry in EDCs, which is the current standard for most RCTs.
First, all patients who visited the outpatient cardiology clinics of the three participating medical centers during the recruitment phase (October 1, 2016, to December 1, 2018) of the LoDoCo2 trial were automatically and anonymously screened retrospectively for eligibility of participation in the trial according to its inclusion and exclusion criteria (Fig. 2). The yield of eligible patients via this Fig. 1. Layers of data collected during trials (left: required trial data collection when not using EHR data in perspective to data available in EHR; right: (theoretical) required trial data collection when using EHR data).

Key findings
Compared to conventional methods, automated text-mining in electronic healthcare records (EHRs) can substantially reduce the number of patients that need to be screened for trial enrollment and the amount of labor to collect data.

What this adds to what was known
Previous studies mining parts of EHRs showed mixed results for text-mining methods and mainly focussed on observational and registry datacollection.
This study shows that integral text-mining of EHRs yields good results for trial participant screening and data-collection.

What is the implication and what should change now
Clinical trials should consider automated text-mining methods to supplement current participant screening and data-collection methods.
method was compared to those actually included in the trial by manual screening for trial participation. Second, baseline characteristics were automatically collected for all trial participants, and accuracy was assessed against manually collected data.

The LoDoCo2 trial
Conventional participant identification and data collection methods used in the international clinical trial LoDo-Co2 were used as the reference standard. The LoDoCo2 trial was chosen as it represents a prototype large international multicenter cardiovascular outcome trial.
In short, the LoDoCo2 trial was a randomized, investigator-initiated international, multicenter study that investigated whether colchicine 0.5 mg once daily as compared to placebo in patients with stable coronary artery disease reduces the incidence of major adverse cardiovascular events [13]. The trial's recruitment started in December 2016 and was completed in December 2018. The trial methodology and results have been reported before [14].

Study population
This study was based on the data of patients visiting the cardiology outpatient clinics of three large Dutch medical centers. The medical centers were selected to represent the major EHR software vendors in the Netherlands (Epic [Hospital A], ChipSoft [Hospital B], CSC Care solutions [Hospital C]; cumulatively used in 80% of the Dutch hospitals and almost 10% of the hospitals worldwide [15,16]).
Participants of the LoDoCo2 trial were retrieved on their trial identification number and unique on-site identifier as recorded in their EHR files. Participants for which no trial identifiers were reported in the EHR were ignored since they could not be linked to CRF data functioning as the reference standard.

Automatic, using text-mining from EHRs
A Boolean retrieval query to obtain the required data was developed in adherence with the eligibility criteria of the LoDoCo2 trial by two authors (WBvD and ATLF) (Supplement 1a). For developing the query a graphic user interface data mining tool with text-mining features was used (CTcue, version 2.0.12; Amsterdam, The Netherlands). This data mining tool integrally searched structured and unstructured EHR data (including clinical letters, in-hospital consultations, procedures, diagnostic tests, and drug prescriptions).
Both authors who developed the query were considered to have content expertise from their medical backgrounds and had extensive experience in query development. Additionally, one of these authors (ATLF) was also a lead investigator of the LoDoCo2 trial.
The query consisted of regular expressions of the eligibility criteria as given by the LoDoCo2 trial, their synonyms, and negations (e.g., ''no hypertension'' instead of ''hypertension''). Synonyms were added using the automatic synonym expander built into the data mining tool and supplemented with synonyms and abbreviations commonly used by the query developing authors (Supplement 1a).
For precluding automatic retrieval of information entered in the EHR after trial participation, only data registered in EHRs prior to the screening of the trial were used. No site-specific optimizations were added to the query, except for the retrieval of trial participants and periprocedural drug recognition adjustments. To approximate data collection as would have been performed in the trial, the most recent status on any data point before entering the trial was taken. Additionally, drug use data were limited to data registered within a year of enrollment. When no measurement of a variable was found, it was assumed to be absent for the participant.

Manual participant identification, as used in the LoDoCo2 (reference standard)
Trial investigators of the LoDoCo2 trial used two steps to identify trial participants. First, manual screening was performed for eligibility using the EHR files prior to their outpatient clinic visit. Second, patients were interviewed face-to-face to verify eligibility and ask for participation. After providing informed consent, participation in the trial ensued.

Automatic, using text-mining from EHRs
A query was developed to automatically collect data form the EHRs on nineteen variables, which contained information about demography, medical history, procedure history, and drug use as reported in the baseline table of the trials' methods paper (Supplement 1b). For the development of this query the same methods were employed as used for the participant identification query.

Conventional data extraction, as used in the Lo-DoCo2 trial (reference standard)
In the LoDoCo2 trial, data were collected manually during face-to-face baseline interviews at trial enrollment with participants. Interview data was first recorded as source data on-site and afterward entered into the trial's EDC system.

Participant identification efficiency
For each site, the number of unique patient visits during the trial recruitment period, number of patients automatically identified as potentially eligible, and number of patients enrolled in the trial were recorded and compared to the number of patients enrolled in the actual trial. For both methods, a theoretical yield was calculated based on the patients needed to screen for identification. For determining the yield of the automatic participant identification, the number of enrolled trial participants was used as a proxy as it was not possible to assess how many of the automatically identified potentially eligible patients would have been enrolled retrospectively.

Data collection accuracy
Results of automated EHR text-mining were compared to manually collected trial data on their distributions and accuracy (defined as [true positive data points þ true negative data points]/all data points) on an individual patient level. For clarity and to show agreement between EHR vendors, accuracies of the various medical centers were plotted against the overall accuracy in a forest plot.

Participant identification efficiency
A total of 92,466 patients visited the cardiology outpatient clinic of the three study centers during the recruitment period of the LoDoCo2 trial (October 1, 2016 to December 1, 2018). Of these, 568 patients (0.6%) were enrolled in the LoDoCo2 trial (Fig. 3, Table 1).
For the LoDoCo2 trial, all patients visiting the cardiology out-patient clinics were screened on trial eligibility. Automated EHR data screening resulted in a reduction of 73,863 (79.9%) patients that needed to be screened for trial participation. The remaining 18,603 (20.1%) contained 458 of the actual trial participants (82.4% of participants). Further inspection of the 110 (17.6%) trial participants missed by the data mining tool showed that in the automatically retrieved data on one or more inclusion or exclusion criteria were missing (no proof of coronary artery disease [found as a coronary angiography; CT coronary angiography or Coronary Artery Calcium Score]: n 5 38; no known renal function: n 5 41; date of previous Coronary Artery Bypass unknown: n 5 41). Characteristics of missed participants did not differ substantially from identified participants (median difference of all variables 1.6%, IQR 3.1%); values were therefore assumed to be missing at random.

Data collection accuracy
Of the 568 trial participants, 540 (95.1%) enrolled trial participants were automatically retrieved on their trial identification number or unique on-site identifier with the data mining tool.
On an aggregate level, availability of baseline characteristics for participants using automated EHR text-mining differed by 2.8% (median; IQR across all variables 0.4e8.5%) with manually collected trial data (Table 2; center-specific distributions are presented in Supplement 2a). Notably larger differences between automated EHR text-mining data and manually collected trial data were found for hypertension (26.2%), antiplatelet therapy (29.1%), and beta-blocker use (24.4%).
On an individual participant level, automated EHR textmining data showed 88.0% accuracy (median; IQR 84.7e92.8%) when compared to the conventionally collected trial (Table 2; center-specific accuracy is presented in Supplement 2b). Overall, 9.8% of the data extracted from EHRs were false positive (i.e., data on a variable present in EHR data and not present in trial data), and 3.1% false negative (i.e., data on a variable not present in EHR data and present in trial data) ( Table 3; for contingency tables of different medical centers see Supplement 2c). Of all data points, positive predictive value was 0.928, negative predictive value was 0.937, sensitivity was 0.806, specificity was 0.827, and F1-score was 0.863 (for test performance scores of individual variables, see Supplement 2d). The lowest accuracies were found for hypertension (62.6%), antiplatelet therapy (68.8%), and betablocker use (73.3%). Accuracies for hypertension, antiplatelet therapy, and beta-blocker therapy differed between the participating medical centers, with hypertension ranging from 52.2% to 64.2%, antiplatelet therapy from 60.3% to 86.4% and beta blocker use ranging from 66.4% to 84.7%.

Discussion
This study shows that it is feasible to use automated EHR text-mining to identify eligible trial participants and collect baseline data. By identifying eligible patients, only   [23]. In the same year, EHR medication lists were shown to have very broad accuracy (10e90%) [22]. Studies automatically text-mining EHRs integrally, however, reported more favorable results with accuracies comparable to those found in this study [10,24]. In addition, registries based on routinely collected data have been reported to be of high value for trial recruitment and data collection [25].

Implications for using EHR data in clinical research
When the quality of EHR data extraction is of an acceptable level, it could improve efficacy in trial conduct. As such, EHR data collection would allow the reallocation of resources and a reduction in execution costs [7].

Participant identification efficiency
Using automated EHR text-mining, we were able to identify patients potentially eligible for trial participation. These results are in line with the results found by previous studies [18,19,26]. In participant recruitment, a high positive predictive value using automated EHR participant screening (i.e., most patients screened as positive also enroll in the trial) would maximize efficacy improvements [27]. Our study indicates that automated EHR screening has the potential to identify large numbers of eligible participants in a time-efficient and cost-efficient manner (data not shown).

Data collection accuracy
Since baseline characteristics are not always included in final outcome analysis generally, small errors in these data can be acceptable when counterbalanced by improved efficiency. Incorporation of baseline characteristics measured with error in the analyses would only have an effect on research validity when accuracy is not randomly distributed across intervention groups. If random, it could affect the precision of effect estimates after adjustment [12].
Accuracy of automated EHR data collection depends on the amount of missing data and measurement errors. First, variables collected from data can be missing because they were not recorded or not extracted from the data. Physicians often measure and register only what they consider relevant for delivering care. Consequently, (ordinary) characteristics that are desired in clinical research are not registered [28]. Whether this will lead to problems in identifying patients eligible for trial participation differs per variable and context. Missing data on smoking, for example, will be of less value than missing data on coronary revascularization since clinicians will not always ask about smoking but may be expected to document coronary interventions [29]. These factors make it harder to extract data due to ensuing variability in how characteristics are reported. Substantive knowledge on the topics of data to be extracted is therefore still essential.
Second, EHR data could contain more measurement errors because they were not collected and measured in a standardized format, as is generally done in conventional trial data collection. EHR data can, for example, be hampered in its currency (i.e., stored variables are out of date) due to irregular visits of patients. These remain challenges of the use of EHR data that should be addressed in future research.
Third, relevant information encompassed in the EHR can still be missed due to interindividual differences in reporting or reporting errors (abbreviations, misspelling, synonyms). Improved intelligent text-pattern recognition systems might reduce the risk of missing data.

Future perspectives
EHR data collection will probably be best used in conjunction with other data collection methods instead of replacing them. In the design of trials, investigators can take automated and manual EHR data collection into account in the design phase of the trial. Our results show that automated EHR screening for eligible patients might result in a somewhat different study population compared to the population currently enrolled. Effects on generalizability should be considered, although the resulting patient population might well reflect a more real-world sample of participants if their characteristics differ from the original study population [18]. Benefits of increased efficiency in the identification of eligible patients might make it easier to enroll patients, and as such, reach the desired number of inclusion faster than with conventional participant recruitment.

Limitations of this study
This study combined data from multiple medical centers, all using different EHR software vendors, and shows consistent results for the broad range of systems. Yet, three main limitations should be noted on it.
First, the accuracy of the information on hypertension, antiplatelet therapy, and beta-blockers deviated, notably from collected trial data. Deviation between EHR and trial data was probably due to hypertension being defined as ''using antihypertensive drugs'' in the LoDoCo2 trial, which was hard to mirror in the EHR search query. Deviations on drug prescriptions and use variables were mainly attributed to registered timeframes of drugs and insufficient indexing of hospital drug prescription systems by the data extraction tool. Moreover, hospital physicians might not have registered home prescriptions for all patients, adequately deviating results on drugs too. Second, it was assumed that all patients visiting the outpatient cardiology clinics of the three hospitals were screened conventionally for participation in the LoDoCo2 trial. If this was not the case expected yield of automated participant identification would be overestimated in this study.
Third, our Boolean query was not enhanced with natural language processing algorithms because of the limitations of the employed data mining tool and language-specific limitations. Text-mining was, therefore, interpreted broadly as the ability to automatically extract information from unstructured texts.

Conclusions
Data extracted from EHRs using text-mining can be used to identify patients eligible for trial participation and for the collection of baseline characteristics. This method might substantially reduce time and costs related to recruitment and data collection in clinical trials. Whether this premise can be realized depends on whether small accuracy losses are deemed acceptable in the context of the trial that is performed. This study focused on patient eligibility screening and participant baseline data collection; future research is needed to assess the quality of outcome detain EHRs.