Changing predictor measurement procedures affected the performance of prediction models in clinical examples

Objectives: The aim of this study was to quantify the impact of predictor measurement heterogeneity on prediction model performance. Predictor measurement heterogeneity refers to variation in the measurement of predictor(s) between the derivation of a prediction model and its validation or application. It arises, for instance, when predictors are measured using different measurement instruments or protocols. Study Design and Setting: We examined the effects of various scenarios of predictormeasurement heterogeneity in real-world clinical examples using previously developed predictionmodels for diagnosis of ovarian cancer,mutation carriers for Lynch syndrome, and intrauterine pregnancy. Results: Changing the measurement procedure of a predictor influenced the performance at validation of the prediction models in nine clinical examples. Notably, it induced model miscalibration. The calibration intercept at validation ranged from 0.70 to 1.43 (0 for good calibration), whereas the calibration slope ranged from 0.50 to 1.67 (1 for good calibration). The difference in C-statistic and scaled Brier score between derivation and validation ranged from 0.08 to þ0.08 and from 0.40 to þ0.16, respectively. Conclusion: This study illustrates that predictor measurement heterogeneity can influence the performance of a prediction model substantially, underlining that predictor measurements used in research settings should resemble clinical practice. Specification of measurement heterogeneity can help researchers explaining discrepancies in predictive performance between derivation and validation setting. 2019 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http:// creativecommons.org/licenses/by/4.0/).


Introduction
Clinical prediction models are commonly applied in clinical practice to assist health care professionals in determining a patient's diagnosis or prognosis [1]. Clinical prediction models are applied to patients that were not part of the data used to derive the model, often with the aim to estimate a probability for the presence of a disease or future health state [2]. When applied on new patients, the performance in estimating these probabilities is often different from the performance in the derivation data. This is commonly explained by model overfitting with respect to the derivation data [3e6] and differences in patient What is new?

Key findings
Heterogeneity of predictor measurements across settings of derivation and validation had a substantial influence on predictive performance at validation, most notably on risk prediction model calibration.
Switching the measurement strategy of a predictor within the derivation set minimally affected measures of discrimination and overall accuracy.
What this adds to what was known? Discrepancies in predictive performance between derivation and validation setting are commonly explained by the specific modeling strategies (that may result in overfitting) and by differences in case-mix distribution across settings. Our study identifies predictor measurement heterogeneity as another substantive explanation of unanticipated predictive performance at model validation or implementation.

What is the implication and what should change now?
Our findings underline the importance of transparent reporting of the predictor measurements that are used for derivation and validation of a prediction model.
Our findings provide initial guidance on implementation of clinical prediction models. In a clinical setting, predictors should be measured using procedures similar to those that were used for predictor measurement in the derivation and validation study to provide well-calibrated predictions of the outcome of interest.
characteristics (case-mix) between derivation and validation settings [7e9]. Previous studies have identified imprecise predictor measurement procedures as another reason for a suboptimal performance of prediction models at derivation [10,11] and highlighted that differences in predictor measurement procedures between derivation and validation setting substantially affected performance at validation [12e14]. Predictor variables may be measured by different procedures in external validation data than those applied in derivation data, that is, according to different measurement protocols, measurement instruments, or by applying different predictor definitions. We refer to these differences in measurement across settings as predictor measurement heterogeneity. Simulation studies have shown that predictor measurement heterogeneity can induce miscalibration of prediction models and affect discrimination and accuracy at external validation [12]. Although predictor measurement heterogeneity across derivation and validation samples appears to be common in clinical (research) settings (see, e.g., studies by Collins et al. [4], Te Velde et al. [15], and Smith et al. [16]), its impact on the performance of prediction models at validation is not well studied using empirical data.
In this study, we quantify the impact of predictor measurement heterogeneity on predictive performance in a series of real-world clinical examples.

Illustrating and defining predictor measurement heterogeneity
We briefly illustrate predictor measurement heterogeneity here using measurements of the predictor body mass index (BMI). We fitted a logistic regression model to predict the presence of prestage diabetes containing only two parameters for a linear and a quadratic term of BMI besides the intercept (this example was adapted from the study by Rosella et al. [11]). Data were available on 1,264 participants from the NHANES Study 2013e2014 [17]. BMI data were computed from participants' height and weight measurements, obtained by a trained examiner who followed a standardized protocol [18]. Because this measurement is close to what we would consider the ideal method of measurement, we will refer to it as the preferred measurement. The second measurement of BMI was computed via selfreported weight and height by the participants, which we will refer to as the pragmatic measurement. The concept predictor measurement heterogeneity refers to the phenomenon where the predictor measurement strategy at derivation differs from the measurement strategy at validation or application of the prediction model. A second regression model was fitted with a linear and quadratic term for BMI using the pragmatic measurement of BMI. Comparing the output of the two regression models, it becomes clear that substituting the preferred measurement of BMI with the pragmatic measurement changed the distribution of the linear predictor (Fig. 1). To better understand how substitution of pragmatic by preferred measurements (and vice versa) can affect predictive performance, we present empirical case studies in the next sections.

Methods
We examined the effects of predictor measurement heterogeneity in previously established prediction models, using three empirical datasets on the diagnosis of ovarian cancer, hereditary nonpolyposis colorectal cancer (CRC; Lynch syndrome), and intrauterine pregnancy, respectively. Scenarios from various clinical domains were investigated to provide a general assessment of the potential impact of predictor measurement heterogeneity.

Example dataset 1: diagnosis of ovarian cancer
The International Ovarian Tumor Analysis (IOTA) dataset includes clinical and ultrasound information on 5,914 nonpregnant women with at least one persistent adnexal mass [19]. We used data from IOTA phases IeIII (1999e2012) in which we studied two prediction models, here referred to as Model 1 and Model 2. Model 1 is a logistic regression model that estimates the probability of presence of ovarian mass malignancy from preoperatively measured predictors: age (years), maximal diameter of the tumor (mm), personal history of ovarian cancer (yes/ no), current use of hormonal therapy (yes/no), experience of pain during examination (yes/no), presence of ascites (yes/no), presence of blood flow within a solid papillary projection (yes/no), maximal diameter of the largest solid component (mm), presence of irregular cyst walls (yes/ no), presence of acoustic shadows (yes/no), color score of intratumoral blood flow (ordinal, ranging 1e4), and presence of entirely solid tumors (yes/no). Model 1 is based on the LR1 model, which was developed and internally validated in IOTA phase-I data [20] and has been externally validated several times [21e23]. Model 2 is a logistic regression model to preoperatively diagnose ovarian mass malignancy by age (years), the proportion of solid tissue, the presence of more than 10 locules (yes/no), the number of papillary structures, the presence of acoustic shadows (yes/no), and the presence of ascites (yes/no). It is a previously described reduction of Model 1, developed for methodological illustrations [24].

Example dataset 2: prediction of mutation carrier status (Lynch syndrome)
We analyzed data from 19,866 patients with CRC, who were tested for mutations in Lynch syndromeerelated mismatch repair genes. We studied a simplification of the PREMM 1,2 model [25] and MMR predict model [5,26] in the Lynch syndrome dataset, which we refer to as Model 3. Model 3 is a logistic regression model that predicts the prevalence of MLH1/MSH2 mutations from the following predictors measured at baseline: sex, age at CRC diagnosis (years), and family history of CRC and endometrial cancer. Family history was defined as a weighted sum of positive first-and second-degree relatives, where second-degree relatives were weighted half times the first-degree relatives. The sum ranged from 0 to 3, with family history coded as 0, 1, or 2þ affected relatives.

Example dataset 3: prediction of intrauterine pregnancy
We analyzed data from 75 consecutive patients at the Early Pregnancy and Acute Gynecology Unit at Queen Charlottes' and Chelsea Hospital from November 2013 to May 2014. We studied a logistic regression model in the pregnancy data, here referred to as Model 4, that predicts the probability of an ongoing intrauterine pregnancy based on measurements of human chorionic gonadotropin (hCG) level at presentation (pmol/L) and an hCG ratio of hCG at 48 hours after presentation to hCG at presentation. hCG Levels could be measured using two different measurement instruments, named the ''ria kit'' and the ''imm kit.'' Model 4 is adapted from an existing multinomial logistic regression model (named M4) [27] by grouping the outcome categories ''ectopic pregnancy'' and ''pregnancy of unknown location.''

Models and assessment of predictive performance
To separate the impact of predictor measurement heterogeneity from other possible external validation effects on predictive performance, such as changes in case-mix and outcome incidence, we focus on derivation and validation within the same study population and evaluate predictive performance [28]. In each example, we defined scenarios of measurement heterogeneity by identifying two measurement procedures of a single predictor: a preferred measurement and a pragmatic measurement ( Table 1). The terms ''preferred'' and ''pragmatic'' are only meant in a relative Fig. 1. Impact of predictor measurement heterogeneity on distributions of linear predictors. Density of the logit transformation of the predicted risks (linear predictor) from a logistic regression model predicting the probability of a prestage of diabetes using the predictor BMI. BMI was obtained as an instrumental (preferred) and selfreported (pragmatic) measure. Distributions of the linear predictors for both procedures are presented. The prediction model was The intratumoral blood flow was scored by a color score ranging 1e4 in the original model. Alternatively, the extremes of this score (1 or 4) could be used as model input, because: A color score is a subjective measurement; at model application, physicians could score the colors at the extremes (either no or high blood flow). Researchers could use a (public) dataset for model validation in which only a binary version of the score is available, rather than a categorical score, and recode this variable into scores 1 or 4. sense: a preferred measurement may still be far from the ideal measurement of a particular phenomenon, but as a predictor of a particular outcome, it could be preferable over the pragmatic measurement in terms of a lower measurement error or anticipated better predictive potential for the particular outcome.
For each scenario, we assessed the optimism-corrected predictive performance of a regular maximum likelihood logistic regression model under both predictor measurement homogeneity and heterogeneity. The optimism correction was performed because measures of predictive performance based on the derivation data may give an overoptimistic assessment of model performance, as maximum likelihood models are generated to provide the best fit for the derivation data [28]. Measures of predictive performance were obtained by deriving and validating a prediction model in 500 bootstrap samples and averaging optimism-corrected measures of performance over the bootstrap samples (see Supplementary Material 1 for detailed explanation) [28]. To assess predictor measurement homogeneity, the prediction model was derived and validated based on the same predictor definitions. To assess predictor measurement heterogeneity, a derivation and validation setting were recreated by deriving the model using Measures of predictive performance were averaged over 500 bootstrap samples and corrected for optimism. Confidence intervals for the C-statistic and scaled Brier score were obtained by subtracting the optimism from the 95-percentile interval over the 500 bootstrap estimates of the performance measure under predictor measurement homogeneity. Scaled Brier score is computed as: 1eBrier/Brier max . a The hCG measurements are included in the model as a log-transformed hCG measurement at presentation plus a log-transformed ratio of hCG at 48 hours to hCG-at-presentation measurement.
the preferred measurement and validating the model using the pragmatic measurement, denoted scenarios 1ae9a, or by deriving the model using the pragmatic measurement and validating the model using the preferred measurement, denoted scenarios 1be9b. Note that we isolate the impact of measurement heterogeneity here by keeping all other factors besides measurement heterogeneity constant (i.e., the modeling strategy, the included predictors, and patient characteristics are equal at derivation and validation).
Measures of predictive performance were the calibrationin-the-large coefficient and calibration slope from a logistic recalibration model, the C-statistic (area under the receiver operating characteristic curve) and the Brier score. Model calibration refers to the agreement between observed outcomes and risk estimates [1,29]. The calibration-in-thelarge coefficient evaluates whether there is a difference between the observed event fraction and the average predicted risk (0 for perfect calibration) and is estimated as the intercept of the recalibration model while the calibration slope is fixed at a value of 1. The calibration slope (!1 indicating overfitting, i.e., predicted risks that are too extreme, and O1 indicating underfitting) was computed by regressing the observed outcome on the logit transformation of the predicted risks and evaluated graphically by plotting loess calibration curves. We considered the scaled Brier score, in which the Brier score is scaled by its maximum score under a noninformative model, Brier scaled 5 1ÀBrier/Brier max so that it ranges from 0 for perfect predictions to 1 for noninformative predictions [1,29].
To quantify the resemblance between the predictor measurement procedures, the partial correlation between the preferred and pragmatic predictors was estimated by correlating residuals of two linear regression models regressing each of the predictor measurements on the outcome and other covariates in the model. Shrunken regression coefficients from a Ridge logistic regression model were estimated, for which the tuning parameter (necessary for shrinkage) was determined by the value minimizing the deviance in 10-fold cross-validation [30]. All analyses were performed in R 3.5.1 [31], and R code is available at https:// doi.org/10.5281/zenodo.3571193. Measures of predictive performance were obtained using the rms package [32].

Results
Measures of predictive performance in all scenarios are presented in Table 2 (under measurement homogeneity) and

Predictive performance under predictor measurement heterogeneity
In settings where a different measure of aggregation for defining the predictor was used (scenarios 1abe2ab), the direction of miscalibration was related to the shift in aggregational measure (Fig. 2). When the maximum tumor diameter was used at derivation and the mean at validation, the calibration-in-the-large coefficient was larger than zero, indicating a systematic underestimation of the predicted risks at validation (scenarios 1a and 2a). The reverse occurred in scenarios 1b and 2b. Calibration-in-the-large was more strongly affected in scenario 2ab, where the predictoreoutcome association was higher than in scenario 1ab.
Truncation of a continuous predictor measurement showed the following effects on calibration (scenario 3ab; Fig. 2). When the truncated value was used for model derivation and the nontruncated value at validation, the calibration-in-the-large coefficient indicated systematic overestimation of the predicted risks at validation, and the calibration slope was smaller than one, indicating overfitting with respect to the derivation data; predicted risks were too extreme compared with the observed proportions (and vice versa in scenario 3b).
When the categories of an ordinal predictor were collapsed into a binary variable by using only the extremes of the scale (scenario 4a; Fig. 2), the calibration-in-thelarge coefficient indicated systematic overestimation of the predicted risks, the calibration slope indicated overfitting with respect to the derivation data, and the C-statistic decreased (and vice versa in scenario 4b).
When a more stringent dichotomization was used at validation by shifting the cut-off of a count upward (scenario 5b; Fig. 2) or including only first-degree relatives in a summary score on family history, rather than both first-and second-degree relatives (scenarios 6a and 7a; Fig. 3), risks were systematically underestimated, as indicated by the calibration-in-the-large coefficient (and vice versa in 5a, Fig. 2. Measures of predictive performance under predictor measurement heterogeneity of a model predicting the probability of having ovarian mass malignancy. The model is applied to the International Ovarian Tumor Analysis (IOTA) dataset, containing information on 5,914 nonpregnant women (1999e2012). Error bars represent the 95-percentile interval over 500 bootstrap samples. Error bars with an asterisk (*) indicate scenarios 1ae5a, meaning the model was derived using the preferred measurement and validated using the pragmatic measurement, scenarios with a point () indicate scenarios 1be5b, meaning the model was derived using the pragmatic measurement and validated using the preferred measurement. 6b, 7b). In scenario 6a, the calibration slope indicated model underfitting, the C-statistic decreased, and the scaled Brier score decreased (and vice versa in scenario 6b).
Switching from serum to urine hCG measurements (scenario 8ab) showed the following effects on predictive performance (Fig. 4). When the predictor measurement had a  Error bars with an asterisk (*) indicate scenarios 8a and 9a, meaning the model was derived using the preferred measurement and validated using the pragmatic measurement, scenarios with a point () indicate scenarios 8b and 9b, meaning the model was derived using the pragmatic measurement and validated using the preferred measurement. smaller variance at derivation compared with validation (scenario 8a), the calibration-in-the-large coefficient indicated systematic overestimation of the predicted risks, and the calibration slope indicated model overfitting. The C-statistic and scaled Brier score decreased. The reverse occurred when the predictor measurement had lower variance at validation compared with derivation (scenario 8b), except for the scaled Brier score, which decreased again.
A switch in measurement instrument, that is, using the ria kit vs. using the imm kit for hCG measurement in serum (scenario 9ab; Fig. 4), minimally affected predictive performance. The large uncertainty around measures of predictive performance in scenario 8ab and 9ab can largely be explained by the limited sample size.

Discussion
In this study, we evaluated the impact of predictor measurement heterogeneity in nine different scenarios in three clinical datasets. A change in measurement strategy of a predictor within the derivation set, from preferred measurement to pragmatic measurement or vice versa, minimally affected measures of predictive performance in our example studies. We found that heterogeneity of measurements across settings of derivation and validation can have a substantial impact on the performance of a prediction model, most notably on overall accuracy and calibration of risk predictions, resulting in systematic over-or under-estimation of predicted risks and risk models that are consistent with overfitting (systematically too extreme predictions) or underfitting (systematically a too narrow range of predictions).
In the examples, the impact on calibration was larger when predictors were strongly associated with the outcome or when the partial correlation between predictor measurement strategies was lower. Using Ridge regression as a shrinkage method or correcting for optimism did not compensate for the effects of measurement heterogeneity in our study. The variety of effects on predictive performance in the examples illustrated the difficulty of anticipating the exact impact of predictor measurement heterogeneity, emphasizing the need to be generally mindful of (dis)similarities of predictor measurement strategies between derivation and validation studies.
We observed small effects of predictor measurement heterogeneity on the discriminatory power of the model at validation in our examples. Previous simulation studies found larger effects on the C-statistic [10e12]. Our finding may be explained by the fact that we focused on withinsample predictive performance under measurement heterogeneity in a single predictor. With a larger number of predictors subject to measurement heterogeneity, we anticipate the combined effect on the discrimination performance can be larger. In addition, given that the C-statistic is a rank order statistic, it is possible that this metric is less affected by measurement heterogeneity [33].
Our findings showed that internal predictive performance may not be affected by changes in predictor measurement strategy within the same dataset, in line with previous studies [10,11,34]. Previous research showed that variations in measurement error did not affect risk calibration [10], but these findings were restricted to withinsample effects on predictive performance only. Within the derivation dataset, models derived using logistic regression achieve, by definition, a calibration-in-the-large coefficient of zero and calibration slope of one, regardless of the measurement error structure of predictors [29]. Our study highlights that this does not apply when the degree or structure of measurement error varies across settings of derivation and validation, the case of measurement heterogeneity.
It is common practice in validation studies to quantify the relatedness of derivation and validation samples by inspecting the distribution of the linear predictors, also referred to as comparison of case-mix distributions [8,9,35]. Dissimilarities in the distributions of the linear predictor between derivation and validation may rise from both actual differences in patient characteristics and differences in the procedures used to measure patient characteristics. By identifying predictor measurement heterogeneity as a separate explanation of discrepancies in linearpredictor distributions across settings, our findings can facilitate the implementation of the influential TRIPOD statement in clinical prediction research [36].
Our study has several limitations. First, it was limited to three empirical datasets with a diagnostic outcome modeled using logistic regression. One dataset, from the IOTA study, was a multicenter study in which homogeneous measurement strategies across centers was among its hallmark characteristics [19]. Measurement heterogeneity within development and validation studies, for example, because of variability in measurement precision between clinicians or centers [37], is an important topic for future research. Given the potential impact and limited attention to date [38], research is needed on the effect of measurement heterogeneity for other statistical models and outcomes (e.g., survival models for time-to-event outcomes) and the impact on more flexible prediction modeling strategies. Finally, the similarity between the preferred and pragmatic measurement of a predictor was quantified using a partial correlation coefficient. This measure quantifies the conditional association between predictor measurements rather than agreement [39]. As the present article aimed to examine whether variation in predictor measurement strategies across settings can have an effect on predictive performance of any degree or direction, we presented a single measurement of similarity of predictor measurements and left out further quantification. One way to visualize agreement between measurement could be BlandeAltman plots [40].
The following recommendations follow from our work. When a prediction model is derived, predictor measurements should be clearly defined and ideally resemble procedures in the intended setting of application as closely as possible. For prediction model validation studies, we encourage researchers to investigate to which extent predictor measurement procedures are homogeneous and may have contributed to differences in predictive performance between the validation and derivation setting. Accurate reporting of predictor measurement heterogeneity in both derivation and validation studies is, therefore, essential. Furthermore, we take the position that addressing measurement heterogeneity at the data collection stage is preferred over statistical correction for measurement error in predictors. Correctionsdtypically aiming to alleviate measurement-error bias in regression coefficientsdmay increase rather than reduce the measurement heterogeneity [12].
We emphasize that consideration of predictor measurement heterogeneity is crucial also in the implementation stage of a prediction model in clinical practice. Deployment of a prediction model might alter predictor measurement heterogeneity. For example, after the implementation of a prediction model, physicians may be recommended to use a more precise or standardized measurement (or even routinely measure predictors that were not measured in all patients up to that point). For the implementation of prediction models in clinical practice, our findings indicate that measurement procedures should follow the measurements in derivation and validation datasets as closely as possible.
In summary, our findings highlight that predictor measurement heterogeneity can have a substantial influence on the performance of a prediction model, most notably on risk calibration. Explicit reporting of the procedures and timing involved in the measurement of predictors in derivation and validation studies is vital to improve the performance and applicability of prediction models in clinical practice.