Advertisement
Journal Home
Search for

Volume 56, Issue 1, Pages 28-37 (January 2003)


View previous. 4 of 14 View next.

Developing a prognostic model in the presence of missing data: an ovarian cancer case study

Taane G. ClarkCorresponding Author Informationemail address, Douglas G. Altman

Received 8 October 2001; received in revised form 15 February 2002; accepted 19 July 2002.

Abstract 

When developing prognostic models in medicine, covariate data are often missing and the standard response is to exclude those individuals whose data are incomplete from the analyses. This practice leads to a reduction in the statistical power, and may lead to biased results. We wished to develop a prognostic model for overall survival from 1,189 primary cases (842 deaths) of epithelial ovarian cancer. A complete case analysis restricted the sample size to 518 (380 deaths). After applying a multiple imputation (MI) framework we included three real values for each one imputed, and constructed a model composed of more statistically significant prognostic factors and with increased predictive ability. Missing values can be imputed in cases where the reason for the data being missing is known, particularly where it can be explained by available data. This will increase the power of an analysis and may produce models that are more statistically reliable and applicable within clinical practice.

Article Outline

Abstract

1. Introduction

2. Patients and methods

2.1. Study population

2.2. Outcomes of interest

2.3. Prognostic factors

2.4. Auxiliary variables

2.5. Missing data

2.6. Multiple imputation

2.7. Investigating the missing data

2.8. Imputation model

2.9. Creating 10 complete datasets

2.10. Statistical models

2.11. Measures of model performance

2.12. Statistical software

3. Results

3.1. Long-term outcome

3.2. Missing data

3.3. Evidence of MAR data

3.4. Imputed data

3.5. Univariate survival analysis

3.6. Cox models

3.7. Measures of model performance

4. Discussion

Acknowledgment

References

Copyright

1. Introduction 

return to Article Outline

The identification of factors related to prognosis and their use in predicting the prognosis of individual patients have importance in clinical research and clinical practice. Prognostic factors are used in designing clinical trials, controlling for confounding factors in observational studies of treatment efficacy, counselling patients, formulating strategies for treatment, and for the optimal use of expensive medical tests [1]. Construction of prognostic models or prognostic indices based on multiple factors (e.g., Nottingham Prognostic Index [2] in breast cancer) is often hampered by the small size of available datasets, especially in rare diseases.

Large databases of patients are required to establish reliably the effects of different prognostic factors on long-term outcome. Frequently, however, data on prognostic factors are missing for some patients, and the standard response to this problem is to exclude these individuals from the analysis. This practice not only wastes valuable data, but can also lead to invalid results if the excluded group is a selective (i.e., non-random) sub-sample from the entire sample with respect to prognosis. In addition, the statistical power of such analyses will be reduced and the number of events per potential prognostic variable may decrease resulting in less stable results [3].

Ovarian carcinoma is the commonest cause of death from gynecological malignancy in most of the Western world. About 6,000 women in the United Kingdom develop ovarian cancer each year and about two-thirds of the women will die from the disease. The age-standardized death rate of 15/100,000 has doubled in the past 70 years, and 5-year survival is just 30% [4].

This article discusses an application of a missing data methodology called multiple imputation (MI) to an ovarian cancer dataset, with the purpose of fitting a prognostic model for overall survival. Our focus was solely on missing data for potential prognostic factors, as survival data were available for all patients. We describe the pattern of the missing data, and assess the effects of missing data on the structure and performance of the prognostic model. In addition, we discuss whether the application of MI was necessary.

2. Patients and methods 

return to Article Outline

2.1. Study population 

We used data from the 1,189 patients diagnosed at the Western General Hospital (Edinburgh) between 01/01/1984 and 31/12/1999 with primary cases of epithelial ovarian cancer [5]. The date of diagnosis refers to the time at which tissue samples were taken. Follow-up data were available up until the end of February 2001. Patients were aged between 15 and 90 years at the time of diagnosis (median 61 years), presented at initial diagnosis with predominantly FIGO stages III and IV (64.6%), and predominantly with a serous papillary histology (51.8%). One thousand one hundred forty (95.9%) patients underwent surgery (their date of diagnosis), 894 (72.3%) patients received chemotherapy, and 860 (75.4%) had both surgery and chemotherapy. Of those patients who received chemotherapy, 654 (73.2%) patients were treated with single agent platinum regimens.

2.2. Outcomes of interest 

Overall survival, or equivalently, time to death, was the outcome of interest.

2.3. Prognostic factors 

The prognostic factors investigated were age at diagnosis, FIGO stage, grade of tumor (I = well differentiated, II = moderately differentiated, III = poorly differentiated), histology, the presence or absence of ascites, the diameter of the largest residual tumor mass after primary cytoreductive surgery (<2 cm, 2–5 cm, and >5 cm), performance status using the ZUBROD-ECOG-WHO scale (0 = Normal activity, 1 = Symptoms, but nearly ambulatory, 2 = Some bed time, but needs to be in bed less than 50% of the normal daytime, 3 = Needs to be in bed more than 50% of the normal daytime, and 4 = Unable to get out of bed), and the first CA125, alkaline phosphatase, and albumin laboratory test results that fell between diagnosis and 7 days after the first chemotherapy.

2.4. Auxiliary variables 

There were a number of additional variables that are not considered as factors in prognostic studies, but were potentially associated with missing data. These included: surgery (yes, no), clinical trial participation (yes, no), and chemotherapy regimen (none, platinum-based, nonplatinum based). These variables were free of missing data.

2.5. Missing data 

We were only interested in missing values of variables that are potentially prognostic. Of the 10 prognostic variables considered, 8 had noncomplete data with the proportion of missing data varying between 2 and 43%. We did not assume that our missing data were a random sample of the whole dataset, that is, missing completely at random (MCAR) [6]. Instead, we have assumed that the missing data is missing at random (MAR) [6]. This missing data mechanism assumes that the probability that a data value is missing depends on values of variables that were actually measured. In other words, we assumed that the missingness of a variable cannot depend on the values of variables that we did not collect data on. Conceptually, it also excludes a dependence of the occurrence of missing values on the true, but unobserved, value of the variable (called missing not at random—MNAR; [6]).

Most reasonably sophisticated statistical techniques assume the data are MAR. These methods can be broadly classified as either imputation approaches or likelihood-based approaches [7] (where the Expectation-Maximization algorithm is usually applied). The major distinction is that the imputation methods fill in missing data, while likelihood-based approaches do not require the missing data to be estimated explicitly. The latter approach requires correct model specification, software is less readily available for survival data, and it is more difficult to model a mixture of binary, polytomous, and continuous prognostic factors in this setting. Hence, we have applied an imputation procedure known as multiple imputation 6, 8.

2.6. Multiple imputation 

The simplest form of imputation or “filling in the missing data” is “single imputation,” where missing values are estimated using single “best guesses” (e.g., the mean of other values). However, an analysis using the resulting “complete” dataset, as if the imputed values were real measurements, ignores the uncertainty in the imputation process. In contrast, multiple imputation (MI) is the analysis of multiple “complete” datasets, and incorporates an adjustment of standard errors and other statistics for the imputation uncertainty.

We applied a MI framework to account for missing prognostic factor data, and used Bayesian simulation to generate the missing data. Our approach is similar to that described by Van Buuren et al. [9]. Suppose that the statistic of interest in the analysis—here the log of the hazard ratio (HR)—is represented by a quantity θ. (In reality θ is a vector of log HRs for different variables, but without loss of generality we have simplified the notation). The steps in the MI procedure we applied were:

1.Investigating the missing data

2.Quantifying the multivariate patterns of the missing data.

3.Plotting the proportion of missing data for each potential prognostic factor against diagnosis year to show time trends in measurement practice.

4.Exploring the relationship between missing data of potential prognostic factors with other prognostic variables, survival information [i.e., (log) survival time and the censoring indicator], and auxiliary variables.

5.Specifying an imputation model.

6.Using the model to generate (via a random sampling procedure) M sets of imputed values for the missing data points, thus creating M completed datasets.

7.For each completed dataset, carrying out a Cox regression (see below), obtaining estimate of interest and its estimated variance vr for i = 1… M.

8.Combining the results from the different datasets to obtain a prognostic model. The MI estimate of θ is

9.e., the mean across the models fitted to the imputed datasets) and the MI estimate of variance is

10.Constructing a final “completed data” model (later referred to as Model 2) by removing the covariate with the highest P-value and repeating steps 4 and 5 until all remaining covariates were significant at a 5% level (backward elimination).

The first term after the equals sign relates to the variance within the imputed datasets, whereas the second term captures the uncertainty due to the variability in the estimates θi from the imputed datasets. The term 1+1/M is a bias correction factor.

Simulation studies have shown that the required number of repeated imputations (M) can be as low as three for data with 20% of missing entries [9]. We had three predictors with approximately twice this percentage, and decided that 10 repeated imputations (i.e., M = 10) would be a conservative choice. Unless rates of missing information are unusually high, there tends to be little or no practical benefit to using more than 10 imputations [6]. In the following sections we describe some aspects of steps 1 to 3, where steps 2 and 3 are the Bayesian simulation. These sections are quite technical, especially the sections describing the imputation model and the sampling process.

2.7. Investigating the missing data 

As we assumed that the missing data were MAR, we looked for evidence of MAR data by assessing associations between missing data and observed variables within our dataset. The relationships between missing potential prognostic factors and survival time were explored using Kaplan Meier curves stratified by a missing value indicator for each of the potential prognostic factors of interest. The log-rank method was used to test for differences between the distribution of survival times for the missing data and nonmissing data for a particular potential prognostic factor. For each prognostic variable the relationship between missingness (yes vs. no) and other variables was assessed using univariate logistic regressions. A more correct approach would be to produce adjusted survival curves and multivariate logistic regressions, but as the adjustment requires prognostic factors, some of which are missing data, this was not possible.

2.8. Imputation model 

Step 2 is an explicit attempt to model the MAR process. Imputation models were specified for each potential prognostic factor with missing data irrespective of the quantity of missing data. The purpose of the models was to provide a set of plausible values for the missing potential prognostic factor data, and involved two modeling choices: the form of the model (linear, logistic, etc.) and the set of predictors that enter the model. The form is considered to be less important, and is largely immaterial unless the uncertainty associated with the missing entries is small [9].

For binary variables (e.g., the presence or absence of ascites) we used a logistic model, for categorical variables with three or more ordered levels (e.g., performance status) we applied a polytomous (>2 levels) logistic model, and for continuous variables (e.g., log CA125) we used normal linear regression truncated where appropriate to the credible range of values.

Composition of the sets of predictors in these models was based on guidelines in Van Buuren et al. (1999) [9]. In particular, all variables that appear in a model constructed using complete cases should be included in the imputation models. Failure to do so may bias the complete data analysis. It has also been observed that including as many predictors in the imputation model as possible tends to make the MAR assumptions more plausible. However, we recognized that there may be no substantial increase in explained variance after selecting a certain number of predictors, and identified computational problems associated with multicollinearity and inclusion of many predictors with missing data. We proposed to confine the imputation model for each prognostic factor to include all other prognostic variables from Table 1, survival variables, and auxiliary variables.

Table 1.

Potential prognostic factors

n = 1189
Prognostic variableN(%)Median (range)P–value
Age at diagnosis (years)1189(100.0)61 (15–90)<.001
FIGO stage <.001
I281(23.6)
II119(10.0)
III590(49.6)
IV178(15.0)
Unknown21(1.8)
Grade <.001
I131(11.0)
II278(23.4)
III641(53.9)
Unknown139(11.7)
Histology <.001
Serous papillary616(51.8)
Endometrioid240(20.2)
Mucinous131(11.0)
Mesonephroid (clear cell)101(8.5)
Mixed mesodermal42(3.5)
Adenocarcinoma38(3.2)
Undifferentiated21(1.8)
Ascites <.001
Presence707(59.5)
Absence417(35.1)
Unknown65(5.5)
Performance status (Zubrod) <.001
0328(27.6)
1228(19.2)
284(7.1)
3+438(3.2)
Unknown511(43.0)
Residual disease <.001
<2 cm641(53.9)
2–5 cm165(13.9)
> 5 cm302(25.4)
Unknown81(6.8)
CA125749(63.0)208 (6–22878)
Log(CA125)749(63.0)5.3 (1.8–10.0)<.001
Alkaline phosphatase (ALP)793(66.7)94 (26–1810)
Log(ALP)793(66.7)4.5 (3.3–7.5)<.001
Albumin797(67.0)39 (20–50)<.001

For example, the imputation model for ascites would have a logistic form with age, FIGO stage, grade, histology, residual disease, performance status, log CA125, log alkaline phosphatase, albumin, log survival time, death, surgery, clinical trial participation, and chemotherapy regimen as predictors.

2.9. Creating 10 complete datasets 

To generate replacement data for the missing values in each of the 10 imputations, we did not assume a particular form for the multivariate distribution of our data. This contrasts with the approach of Schafer (1997), which assumes the distribution to be a multivariate Normal for continuous data, a log-linear model for categorical data, or a general location model for a mixture of continuous and categorical. We assumed only that a multivariate distribution existed, and that Gibbs sampling [10] of the conditional distributions (based on the models above) could generate samples from it. Specifically, this was an iterative process in which the missing data for a prognostic variable were estimated using its imputation model, and then these completed data were used in the estimation of the missing values for another variable. Each iteration ended when all variables had been updated. The application of Gibbs sampling ensured that the imputation process was not deterministic, there was random variation between the completed datasets, and is the main reason the approach is considered (misleadingly) to be a form of Bayesian simulation. The Gibbs sampling algorithm was run for 80 iterations updating for each of the 10 imputations. The time taken for each variable was a function of the prevalence of missing values. It has been found that in the presence of large amounts of missing data, convergence can be obtained in as few as 10 iterations [9]. By plotting the standard deviations and means of the imputations by iteration, an assessment of convergence was made. These plots allowed us to assess whether between imputation-variability had stabilized and whether estimates were free of trend.

The frequency distributions of prognostic variables in the 10 datasets with imputed missing values were compared with the frequency distribution in the original dataset using histograms and basic summary statistics (e.g., medians, ranges, and proportions). This comparison assessed the extent to which the imputed datasets were consistent among themselves and with the original data.

2.10. Statistical models 

Univariate analyses on continuous prognostic factors were performed using Cox regression [10], and categorical (including categorized continuous) prognostic factors were analyzed using the Kaplan-Meier and the log-rank methods. The linearity of continuous variables was assessed using fractional polynomials [11]. Univariate analyses were performed on complete cases in the original dataset.

The fundamental method of multivariate analysis was Cox regression [12]. The proportional hazard assumption for each predictor was tested using an approximate score statistic of linear correlation between the rank order of failure times in the sample and Schoenfeld partial residuals [13].

Like the final “completed data” model, the final “complete case” model (later referred to as Model 1) was obtained using backward elimination, but was based on the original data and not the completed data.

2.11. Measures of model performance 

We evaluated the predictive performance of models by considering measures of discrimination and calibration. Discrimination refers to the ability to distinguish between high risk and low risk patients, and was quantified using the c-index [13] (a generalization of the area under the Receiver Operating Characteristic (ROC) curve [13]) and the Nagelkerke R2 (RN2) [13]. The c-index is the probability of concordance between predicted and observed survival based on pairs of individuals, with c = 0.5 for random predictions and c = 1 for a perfectly discriminating model. Similarly, RN2 = 0 indicates no predictive ability and RN2 = 1 indicates perfect predictions. The term predictive ability is synonymous with discrimination and describes the ability of a set of prognostic factors to explain variation in outcome. In theory, high predictive ability implies accurate prediction for individual patients.

Calibration refers to whether the predicted probabilities agree with observed probabilities, or equivalently, indicates the ability of the model to make unbiased estimates of outcome. Calibration was quantified using an estimate of slope shrinkage and was based on partitioning the patients into 10 risk groups [13]. Shrinkage is the flattening of the plot of predicted (x-axis) and observed (y-axis) probabilities away from the 45 degree line, caused by overfitting. The estimate of slope was based on the mean slope from 200 bootstrap samples of the data. An adjusted slope close to one indicates little shrinkage and good calibration.

2.12. Statistical software 

All statistical analyses were carried out with S-Plus 2000 using the Hmisc, Design, and MICE libraries. MICE is available from www.multiple-imputation.com.

3. Results 

return to Article Outline

3.1. Long-term outcome 

Follow-up information was available on all 1,189 patients, but some were lost to follow-up before the end of study period. 842 (70.8%) patients had died at the time of censoring the data. Median follow-up in the 347 (29.2%) patients who were not known to have died was 1,665 days (range 29–5,852 days). Five-year survival in the cohort was 29.6% (95% CI: 26.8 to 32.5%) in keeping with international mortality rates.

3.2. Missing data 

The potential prognostic factors are summarized in Table 1. Performance status levels 3 and 4 were combined because of small numbers. CA125 and alkaline phosphatase laboratory results were skewed, so a summary of both variables transformed using natural logs is also presented. Data on age and histology were complete for all patients. Performance status, CA125, alkaline phosphatase, and albumin were missing for 43.0, 37.0, 33.3, and 33.0% of patients, respectively. Of the 1,1890 prognostic factor data values, 2,045 (17.2%) were missing in 831 (69.9%) patients, representing less than two missing values per individual in the entire cohort. Two hundred thirty-six (19.8%) patients had four or more missing values, but only 4 (0.4%) patients with seven or more missing prognostic factors. One thousand seven hundred thirty-nine (85.0%) of the missing cells resulted from missing information on albumin, alkaline phosphatase, CA125, and performance status. The number of patients contributing to a complete case analysis using all the prognostic factors in Table 1 would be 358 (245 deaths).

Plots of the proportion of missing data by diagnosis year show that the proportions for ascites, alkaline phosphatase, albumin, grade, and residual disease were constant. In contrast, the proportion of missing CA125 data decreased linearly in time from 85 to 21% between 1984 and 1999, and the proportion of missing performance status had an increasing trend in time with a minimum of 18% in 1986 and a maximum of 71% in 1995. These trends can be explained by the fact that CA125 was not commonly reported in the literature until the mid-1980s, and there has been decrease in prominence of performance status as a prognostic factor in the ovarian cancer literature.

3.3. Evidence of MAR data 

An analysis of the survival distributions of non-missing and missing strata within each of the factors (log) CA125, grade, FIGO stage, and performance status (Fig. 1), showed no visual or statistical evidence of significant differences (Table 2). However, there was a difference between the survival distributions of patients with and without missing data for ascites (P < .002) (Fig. 1), albumin (P < .003), alkaline phosphatase (P = .020) (Fig. 1), and residual disease (P = .020) (Fig. 1). Those patients missing albumin and alkaline phosphatase results had a better prognosis, suggesting that eliminating the patients with missing values would lead to an underestimate of the true survival of the cohort. The opposite effect was seen for ascites and residual disease. Those patients missing ascites data may be more likely to have ascites present at diagnosis. However, as the survival curves are not adjusted for other variables, these effects may not be real and, of course, in theory one might miss an effect if masked by some other association.


View full-size image.

Fig. 1. Differences in survival of patients with and without recorded performance status, ascites, residual disease, and alkaline phosphatase, tested using log rank methods.


Table 2.

Associations between missingness and other potential prognostic factors, auxiliary variables and survival time

Associated with
Prognostic variablesAuxiliary variables
Missingness ofAgeFIGOGradeHistologyAscitesPerf. Stat.Residual diseaseLog CA125Log ALKPAlbuminSurgeryClinical TrialChemo.Survival Timea
FIGO stage XX
GradeX XXXXX
AscitesX XXX<<
Performance statusXXXXX XXXXX
Residual diseaseXXXX X<<
Log CA125XXXXX XXX
Log ALKPXXXXXXX XXX>>
AlbuminXXXXXXX XXX>>
a

Based on a log rank test comparing survival distributions in missing and nonmissing groups, X = an association found using univariate logistic regression (P<.05), << those with missing data had a significantly worse survival, >> those with missing data had a significantly better survival,— no association.

The univariate logistic models indicated that the missingness of potential prognostic factors were (potentially) associated with other potential prognostic factors and auxiliary variables. Histology and clinical trial participation were associated with the missingness of all but one prognostic variable (Table 2). These results must be interpreted with the same caution because they were from univariate analyses.

3.4. Imputed data 

We completed 10 data sets by imputing 2,045 values in each. As a consequence, 6,265 additional real data values were incorporated into each dataset. The summary statistics for the datasets with imputed data are shown in Table 3. The narrow ranges of imputation values for each potential prognostic variable coincides with the visual impression that the distributions for each of the potential prognostic factors in the 10 imputed datasets were similar. The prevalences (%) of categorical prognostic factors in the original data (ignoring missing data) were consistent with those from the 10 imputations. The median and range of albumin, log CA125, and alkaline phosphatase in the original data were consistent with the median of the median of the 10 imputation distributions, and the extreme values of these distributions, respectively.

Table 3.

Results of the missing data imputation compared with original data

OriginalCompleteda
Prognostic Factor# or (median)% or (range)MedianRangeOverall %
FIGO stage
I28124.1286284–28924.1
II11910.2122120–12410.2
III59050.5599599–60250.4
IV17815.2183179–18615.3
Unknown210
Grade
I13112.5149144–15312.5
II27826.5315310–32126.5
III64161.0724716–73260.9
Unknown1390
Ascites
Presence70762.9750747–75263.0
Absence41737.1440437–44237.0
Unknown650
Perf. Stat
032848.4581569–59048.9
122833.6369363–38231.1
28412.4151141–15812.6
3+4385.68781–1047.4
Unknown5110
Res. Disease
<2cms64157.9681676–68757.3
2–5cms16514.9181177–18615.2
>5cms30227.3326323–33327.5
Unknown810
Log CA125(5.34)(1.79–10.04)5.161.79–10.04
Albumin(39.0)(20.0–50.0)39.020.0–50.0
Log Alk. Phos.(4.54)(3.26–7.50)4.543.26–7.50
a

Ten datasets with original data augmented by imputed missing values.

3.5. Univariate survival analysis 

Univariate analyses suggested that age, FIGO stage, the presence or absence of ascites, performance status, histology, residual disease, albumin, grade, and (log) CA125 and alkaline phosphatase were significant (P < .05) prognostic factors of overall survival (Table 1). Investigations of linearity of CA125, alkaline phosphatase, age, and albumin in the univariate models suggested that CA125 and alkaline phosphatase required a natural logarithmic transformation.

3.6. Cox models 

Table 4 shows the results from fitting the Cox models. The first column represents an analysis of complete cases in the original data (Model 1). As four factors, each with missing values, were found not to be prognostic, the analysable dataset was 518 (380 deaths), that is, larger than the 358 patients available when fitting the full model. The second column represents a pooled analysis using 10 complete datasets with imputed missing values (Model 2). Age, FIGO stage, performance status, residual disease, and (log) alkaline phosphatase were significant (P < .05) prognostic factors in both models. Specifically, there was a greater risk of mortality associated with older age, higher FIGO stage, the presence of ascites, worse performance status, and greater residual disease and alkaline phosphatase levels. In both models, the histology comparison between the “Serous” and “Mixed Mesodermal” groups was significant (P < .02). Using a likelihood ratio test the overall histology effects in each model were significant (P < .001). There were a greater number of statistically significant differences (P < .05) between “Mixed Mesodermal” and the other histologies in Model 2 than in Model 1. Grade and ascites were statistically significant in Model 2, but not in Model 1. Log CA125 (P = .95) and albumin (P = .06) were not statistically significant prognostic variables in either model, and therefore, not included in Table 4. It is worth noting that a complete case analysis based on Model 2 would include only 449 patients (319 deaths).

Table 4.

Prognostic models for overall survival

Multivariate models
Model 1a (n = 518, # deaths=380)Model 2b (n = 1189, # deaths = 842)
PredictorsHR (95% CI)HR (95% CI)
Age (years)1.013 (1.002,1.023)1.022 (1.015,1.030)
FIGO stage
I1.0001.000
II2.227 (1.297,3.823)1.842 (1.335,2.541)
III4.154 (2.580,6.689)3.207 (2.430,4.232)
IV4.477 (2.607,7.687)3.035 (2.127,4.330)
Grade
I1.000
II1.506 (1.075,2.110)
III1.621 (1.164,2.258)
Histology
Serous papillary1.0001.000
Endometrioid0.698 (0.520,0.936)0.827 (0.671,1.020)
Mucinous0.873 (0.554,1.375)0.932 (0.683,1.272)
Clear cell1.372 (0.933,2.018)1.357 (1.027,1.795)
Mixed mesodermal2.540 (1.404,4.596)2.095 (1.446,3.036)
Adenocarcinoma1.361 (0.594,3.115)1.273 (0.834,1.941)
Undifferentiated1.649 (0.911,2.986)1.570 (0.968,2.547)
Ascites
Absence1.000
Presence1.395 (1.181,1.648)
Perf. status
01.0001.000
11.183 (0.924,1.515)1.144 (0.942,1.388)
21.490 (1.044,2.127)1.535 (1.198,1.967)
3+43.309 (1.959,5.591)3.977 (2.518,6.281)
Residual disease
>5 cm1.0001.000
2–5 cm0.903 (0.669,1.219)0.900 (0.721,1.123)
<2 cms0.543 (0.410,0.720)0.511 (0.417,0.627)
Log alkaline phos.1.732 (1.324,2.266)1.637 (1.238,2.166)
a

Model 1 was constructed using complete cases in the original dataset.

b

Model 2 was constructed using 10 datasets with original values augmented by imputed missing data.

The confidence limits are narrower in the augmented data, especially for those with less missing observations in the original dataset, which is a direct result of including three additional actual values for each imputed value.

There was no evidence of any violation of the proportional hazard assumptions in any model.

3.7. Measures of model performance 

Table 5 shows the measures of predictive accuracy for Model 1 and Model 2 applied to the original data and completed data (i.e., original data augmented with imputed missing values). Models applied to the completed data had slope shrinkage values close to 1, indicating less overfitting and less need for recalibration of prognostic factor effects. In addition, the models applied to completed data had substantially better measures of discrimination.

Table 5.

Measures of predictive accuracy for Models 1 and 2

Multivariate models
Complete case analysisCompleted datab
Model 1 (n = 518)Model 2 (n = 454)aModel 1 (n = 1189)Model 2 (n = 1189)
Median (range)c
Slope0.8640.8300.951 (0.941–0.954)0.947 (0.939–0.958)
c-index0.7510.7450.783 (0.779–0.786)0.785 (0.780–0.787)
RN20.4110.4020.499 (0.486–0.512)0.510 (0.498–0.524)

See Table 4 for composition.

a

326 deaths.

b

Original complete cases augmented by imputed missing data.

c

Median and range from the 10 analyses.

Model 1 was marginally inferior to Model 2 in overall discrimination when applied to the completed data. However, high values of the c-index (∼0.78) and Nagelkerke R2 (∼0.50) for both models applied to the completed data indicate that their respective set of prognostic factors were explaining the variation in outcome reasonably well.

If we compare the discrimination between Model 1 and Model 2 when applied to the datasets used in their construction, there is clear difference. Figure 2 (left) shows this difference. Four risk groups were created for each of Model 1 and Model 2 by partitioning linear predictions from the respective models into quartiles. The plot shows there is greater separation between risk groups using Model 2, particularly at lower survival times. The differences are a result of increased sample size and two extra prognostic factors. Figure 2 (right) shows the effect of increased sample size only, with the same risk groups from Model 2 as before, applied to all patients and only those with complete data for Model 2. This plot shows that a complete case analysis using Model 2 may overestimate the survival in both risk group 2 after 3 years and risk group 4 in the first year. In both plots, the confidence intervals around the survival curves are narrower in the augmented data.


View full-size image.

Fig. 2. Risk groups and their survival using Models 1 and 2 (see Table 4 for composition).


4. Discussion 

return to Article Outline

Most prognostic studies are performed retrospectively. The construction of prognostic models is problematic when data are missing on one or more prognostic variables. Although missing data are very common, they are infrequently discussed. The standard approach is to exclude those individuals for whom data are missing from the analysis. In addition to being a waste of data that are often costly to collect, this practice could lead to invalid results if the excluded group is a selective subsample (nonrandom sample) from the entire sample with respect to outcome.

We developed a prognostic model for overall survival in 1189 ovarian cancer patients. An analysis based on 518 patients with complete prognostic factor data found that age, FIGO stage, histology, performance status, residual disease, and alkaline phosphatase were significant (P < .05) prognostic factors for overall survival (Model 1). We applied Bayesian simulation to construct 10 complete datasets for all 1,189 patients. Within a multiple imputation (MI) framework we analyzed these individually and pooled the results, adjusting standard errors and other statistics for the uncertainty consequent upon imputing the missing data. By imputing missing data we added three actual data values for each imputed value, and doubled our sample size. With the increased statistical power, grade and ascites were additionally found to be significant (Model 2). The hazard ratios of the prognostic factors common to both models were similar, except that FIGO stage had reduced effect sizes and a weaker linear trend in the augmented data.

Additionally, we found that the models applied to completed data (i.e., the 10 datasets with imputed missing values) had better calibration (i.e., greater ability to produce unbiased estimates of outcome), and superior discrimination (i.e., improved ability to provide accurate predictions for individual patients) than models based on complete cases only. This is a result of increased sample size and possibly less bias. There was little difference between the discrimination measures of Model 1 and Model 2 when applied to the completed data. An explanation for the small difference is the focus of this discussion. In particular, given the evidence of only a small difference in predictive ability between Models 1 and 2, was the application of MI procedure necessary? The answer to these points lies in a discussion of the assumptions and limitations of the statistical methodology.

The major assumption associated with the complete case construction of Model 1 is that the reason for missingness is not related to outcome. If we assume the prognostic model is correct, bias will not arise if the reason for missingness of a variable is associated with other variables in the model or the true values of the missing prognostic variable itself. However, the deletion of noncomplete cases in Model 1 (671 of 1,189 patients included) with (⩾1) missing prognostic factor data will result in less precision. There is evidence, however, that the missingness may be related to outcome. In particular, univariate survival analyses indicated that those patients missing albumin, alkaline phosphatase, residual disease, and ascites had either better or worse survival than those with recorded values. However, only adjusted survival analyses could determine whether the associations were real, but these were not possible because of the multivariate nature of the missingness.

Model 2 assumes that the missing mechanism is missing at random (MAR), that is, the missingness is not related to the true (unobserved) values of the missing data. A complete case analysis does not make this assumption. However, MAR also assumes that the missingness of a variable depends on the values of variables that we collected data on. This implies that the imputation approach may produce valid inferences in the presence of response related missingness. Our imputation model for each variable consisted of all other prognostic factors, auxiliary variables (not incorporated into Model 1) and survival information. The survival time and censoring indicator were incorporated into the imputation model because there was evidence that missingless was related to survival. The somewhat circular approach allows us to include all patients. As a general rule, using all available information yields multiple imputations that have minimal bias and maximal certainty [9]. This principle implies that the number of predictors should be as a large as possible. In addition, and most importantly, it has been observed that including many predictors in the imputation model makes the MAR assumptions more plausible, and reduces the need to make special adjustments for missingness associated with the unobserved true values of missing data 9, 14. Ascites was present in Model 2 and not Model 1, and this may not be entirely a result of increasing sample size, but may also result from reducing bias in the estimation of its effect. However, ascites is a relatively weak prognostic factor and its inclusion has little effect on the predictive ability.

An appealing feature of the application of MI is that it yields actual imputed values, and potentially allows scope for investigating the properties of the resulting prognostic models. Table 3 shows that the imputed data are consistent with the original data, but this need not be true in other applications. The MI process has incorporated the uncertainty of imputing those variables with greater proportions of missing data. Given that using our method allows us to include three extra real data values for each imputed one, the confidence limits have narrowed in most cases (Table 4), less in variables with high levels of missing data (e.g., performance status), and more in those with low levels of missingness (e.g., FIGO stage).

Most data are multivariate in nature, so a small proportion of missing data for several variables can lead to a severely depleted complete case analysis. MI seems appropriate in this setting if the original dataset is not too small. The similarity of the predictive abilities of Model 1 and Model 2 may be explained by the fact that there are no major deviations away from their respective modeling assumptions. This does not imply that the complete case analysis was appropriate in the first place. Using imputed data we are incorporating patients that are removed merely because one or more of their prognostic factors are missing and, as a result, increasing power and adding precision to an analysis (see Fig. 2). Of course, we cannot know what the effect of imputation will be without performing the analysis. In other cases we might find a greater effect for imputation, and this may suggest a complete case analysis is not appropriate. From this point of view our approach may be viewed as a sensitivity analysis, and ultimately we need to use judgement about the plausibility of assumptions in a particular situation to assess which is the primary analysis.

Acknowledgements 

return to Article Outline

We wish to thank Moira Stewart, Hani Gabra, and John Smyth for making the data available, and Christina Davies and Victoria Cornelius for reading earlier drafts of the manuscript. Taane Clark holds a National Health Service (UK) Research Training Fellowship. Douglas Altman is supported by Cancer Research UK.

References 

return to Article Outline

1. 1 Simon R, Altman DG. Statistical aspects of prognostic factor studies in oncology. Br J Cancer. 1994;69:979–985. MEDLINE

2. 2 Haybittle JL, Blamey RW, Elston CW, Johnson J, Doyle PJ, Campbell FC, et al. A prognostic index in primary breast cancer. Br J Cancer. 1982;45:361–366. MEDLINE

3. 3 Peduzzi P, Concato J, Feinstein AR, Holford TR. Importance of events per independent variable in proportional hazards regression analysis. II. Accuracy and precision of regression estimates. J Clin Epidemiol. 1995;48:1503–1510. Abstract | Full-Text PDF (575 KB) | CrossRef

4. 4 CRC CancerStats. http://www.crc.org.uk/cancer/cancer_intro.html, Cancer Research Campaign.

5. 5 Clark TG, Stewart ML, Altman DG, Gabra H, Smyth JL. A prognostic model for ovarian cancer. Br J Cancer. 2001;85:944–952. MEDLINE | CrossRef

6. 6 Schafer JL. Multiple imputation (a primer). Stat Methods Med Res. 1999;8:3–15. MEDLINE | CrossRef

7. 7 Vach W. Some issues in estimating the effect of prognostic factors from incomplete covariate data. Stat Med. 1997;16:57–72. MEDLINE

8. 8 Schafer JL. Analysis of incomplete multivariate data. London: Chapman & Hall;; 1997;.

9. 9 Van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med. 1999;18:681–694. MEDLINE

10. 10 Gilks WR, Richardson S, Spiegelhalter DJ. Markov chain Monte Carlo in practice. London: Chapman & Hall;; 1996;.

11. 11 Royston P, Altman DG. Regression using fractional polynomials of continuous covariates (parsimonious parametric modelling). Appl Stat. 1994;43:429–467.

12. 12 Hosmer DW, Lemeshow S. Applied survival analysis (regression modeling of time to event data). New York: Wiley;; 1999;.

13. 13 Harrell FE. Multivariate prognostic models (issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors). Stat Med. 1996;15:361–387. MEDLINE | CrossRef

14. 14 Rubin DB, Stern HS, Vehovar V. Handling Don't Know survey responses (The case of the Slovenian plebiscite). J Am Stat Assoc. 1995;90:822–828.

Centre for Statistics in Medicine, Institute of Health Sciences, University of Oxford, Old Road, Oxford OX3 7LF, United Kingdom

Corresponding Author InformationCorresponding author. Tel: +44 (0)1865 226874; fax: +44 (0) 1865 226962.

PII: S0895-4356(02)00539-5


View previous. 4 of 14 View next.