A method making fewer assumptions gave the most reliable estimates of exposure–outcome associations in stratified case–cohort studies

Objective A case–cohort study is an efficient epidemiological study design for estimating exposure–outcome associations. When sampling of the subcohort is stratified, several methods of analysis are possible, but it is unclear how they compare. Our objective was to compare five analysis methods using Cox regression for this type of data, ranging from a crude model that ignores the stratification to a flexible one that allows nonproportional hazards and varying covariate effects across the strata. Study Design and Setting We applied the five methods to estimate the association between physical activity and incident type 2 diabetes using data from a stratified case–cohort study and also used artificial data sets to exemplify circumstances in which they can give different results. Results In the diabetes study, all methods except the method that ignores the stratification gave similar results for the hazard ratio associated with physical activity. In the artificial data sets, the more flexible methods were shown to be necessary when certain assumptions of the simpler models failed. The most flexible method gave reliable results for all the artificial data sets. Conclusion The most flexible method is computationally straightforward, and appropriate whether or not key assumptions made by the simpler models are valid.


Introduction
A caseecohort study is nested within a prospective cohort study and is an efficient design because full covariate data are only gathered on the cases (participants who have an event during the follow-up period) and the subcohort. The subcohort is a randomly selected subset of the full cohort at baseline and therefore includes some future incident cases. The proportion of the cohort selected to be in the subcohort is called the sampling fraction. Obtaining data on the full cohort can be expensive, for example if biomarkers or DNA from blood samples are required, so the caseecohort design is more cost-effective than the full cohort study, especially when the event of interest is rare. Another advantage of the caseecohort design is that the same subcohort can be used to study multiple disease outcomes, though if exposures corresponding to different case definitions are measured at different times then batch effects might be a problem.
The selection of the subcohort is sometimes stratified on one or more covariates that are available in all cohort members, to improve efficiency or achieve a similar distribution of these covariates between the cases and the subcohort. The cohort is divided into strata, each with a potentially different sampling fraction for selecting participants to be in the subcohort. For data from this type of study, Borgan et al. [1] described how to fit a Cox proportional hazards model, for estimating exposureeoutcome associations, in a way that aims to account for the stratified design. Recent What is new?

Key findings
This article compares five ways of analyzing stratified caseecohort studies, using data from the EPIC-InterAct study and artificial data sets. A two-stage Cox model with random-effect metaanalysis gave the most reliable results in the widest variety of scenarios and is a flexible model that makes fewer assumptions than the other models investigated.
Three different estimators to account for the stratified caseecohort design were investigated in combination with the proposed models. All estimators performed poorly when the stratification was not incorporated in the model.

What this adds to what was known
This article provides a detailed comparison and discussion of different analysis methods for stratified caseecohort studies, which is lacking in the existing literature. It presents the assumptions, together with the advantages and disadvantages of each proposed model, and makes recommendations for epidemiologists and applied statisticians who want to analyze data from studies with this design.

What is the implication and what should change now?
A flexible two-stage Cox model with random-effect meta-analysis should be routinely considered for analyzing stratified caseecohort studies where the strata may be confounding the exposureeoutcome relationship. Combined with a Prentice estimator, this approach provides reliable results and is generally recommended. An exception may be when there are very few events or strata that may lead to difficulties in fitting the two-stage random-effect model.
articles [2,3] mentioned only this method, but there are several other possibilities. In this article we describe five models, apply them to a caseecohort study, compare their performance using artificial data, and discuss their advantages and disadvantages. We first describe approaches used to fit a Cox model to caseecohort data from an unstratified study, since these form the basis of the five methods for stratified studies, which we describe next. We apply the methods for stratified studies to data from the EPIC-InterAct study [4] (hereafter known as InterAct), which is a stratified caseecohort study of incident type 2 diabetes in 26 centers across Europe. We then use artificial data sets to show how the five methods can give different results. Finally, we discuss these findings and make recommendations for researchers analyzing data from stratified caseecohort studies.

Methods for unstratified caseecohort studies
We first review the use of Cox regression models [5] to analyze data from an unstratified caseecohort study, where the subcohort has been selected by simple random sampling (each participant has equal probability of being in the subcohort). If participant i has exposure of interest x i with associated log hazard ratio b, and confounder z i with log hazard ratio g, then the model for the hazard function is The baseline hazard h 0 ðtÞ is common to all participants. If there is more than one confounder, then g and z i are vectors and gz i is replaced by g T z i .
In a full cohort study, the partial likelihood function would be maximized to estimate b [5]. In a caseecohort study, this is not possible, because x i and z i are only known for the cases and members of the subcohort. The partial likelihood function is therefore approximated by a pseudolikelihood, of which several versions have been proposed that correspond to slightly different estimators. Simulations have shown that the estimator of Prentice [6] leads to estimates of b and its standard error that most closely resemble those that would have been obtained from the corresponding full cohort study [7]. Prentice's estimator also has desirable asymptotic properties [8]. One alternative is the estimator of Barlow [9], which, unlike Prentice's, directly involves the sampling fraction and approximates the likelihood from the full cohort study. These two estimators are explained in detail in Appendix A, in the supplementary material at www.jclinepi.com.
Kim [10] has recently carried out a more thorough comparison of estimators for caseecohort data using simulated data sets. He found that inverse probability weighting estimators are the most powerful, but overall the differences between estimators are small when the proportion of cases in the full cohort is less than 10% and negligible when it is less than 1%. The differences also decrease as the size of the data set (the number of observations) increases.
For caseecohort data, the asymptotic theory for variances in the standard Cox model does not hold [6], and it is necessary to use ''robust'' variance estimates [9,11,12].
The assumption of proportional hazards can be checked by adding to the model an interaction between a baseline covariate and the underlying timescale, or by using an adaptation of Schoenfeld residuals [13].

Methods for stratified caseecohort studies
We now describe methods for analyzing data from a stratified caseecohort study. Each stratum has its own sampling fraction, and within a stratum each participant has equal probability of being selected to be in the subcohort. For ease of exposition we will refer to the strata as centers, since this is a common form of stratification in practice, but the strata can also be defined in terms of one or more other covariates, for example a correlate of the exposure (see the last paragraph of the discussion).
Suppose that participant i in center j has exposure x ij and confounder(s) z ij , and the log hazard ratios corresponding to the exposure and confounder(s) are b and g respectively. We describe five Cox regression models of increasing complexity for exposureeoutcome associations in a stratified caseecohort study and methods for estimating their parameters.
Model I Single Cox model, with center not included in the model: The parameters of this model can be estimated using the estimators of (a) Prentice, (b) Barlow, or (c) Borgan (estimator III in [1]). Estimator (c) is specially intended for the purpose of fitting an unstratified Cox model to data from a stratified caseecohort study. Estimators (a) and (b) are not expected to be appropriate, whereas (c) is formally correct but computationally complex and has only recently become available in standard software. For (c), the usual robust variance estimator is not valid, so an asymptotic estimator is used instead [11].

Model II
Single unstratified Cox model, including center as a categorical covariate: This model assumes proportional hazards between the different centers, with d j representing the log hazard ratio for center j relative to center 1 (so that d 1 50). The three estimators used for Model I can also be used for Model II.

Model III
Single Cox model with the baseline hazard stratified by center: This model does not assume proportional hazards between the centers. Instead it gives each center j its own baseline hazard h 0j ðtÞ. Prentice's and Barlow's estimators can be used. The pseudolikelihood for Barlow's estimator is similar to the pseudolikelihood for Borgan's estimator, so in this article we use Barlow's estimator with Models I and II to compare it with Borgan's.

Model IV
Separate Cox model for each center, assuming a common log hazard ratio for the exposure in all centers: This model allows the confounder effects to vary between centers, as represented by the parameters g j . The data can be analyzed by a two-stage method: first fit a separate Cox model to each center, using Prentice's estimator; then combine the estimates of b using fixed-effect metaanalysis. This model could also be fitted by a one-stage method, with a single pseudolikelihood for the whole data set, by including interaction terms between the center variables and the confounder(s).
Model V Separate Cox model for each center, with each center having its own log hazard ratio for the exposure, and these log hazard ratios following a normal distribution: Here t 2 is the between-study variance of the true centerspecific log hazard ratios. This extends Model IV to allow the exposureeoutcome association to vary between centers; b is now the average log hazard ratio. The model parameters can be estimated using the same two-stage approach as for Model IV but with random-effect (instead of fixedeffect) meta-analysis. In our analyses we used a moment estimator of t 2 [14]. Model V could also be fitted as a one-stage random-effects model, but this is computationally difficult for survival models and results have been shown to be very similar between one-stage and twostage models [15]. The assumptions each model makes about the exposure effect, center effects, and confounder effect(s) are summarized in Table 1, and software functions for them are described in Appendix B.

Application of the models to InterAct
InterAct is a multicenter caseecohort study of incident type 2 diabetes nested within the EPIC-Europe cohort (EPIC is the European Prospective Investigation of Cancer and Nutrition) [4,16]. The cohort from which the InterAct cases and subcohort members were sampled consisted of 340,234 participants, recruited from 1993 to 1999; this was smaller than the full EPIC-Europe cohort because blood samples were unavailable for some EPIC-Europe participants and because two of its countries did not participate in InterAct. The subcohort selection for InterAct was stratified by center, except in France, which was a single stratum (due to the small number of participants in each center in that country). We therefore regard France as a single center, so there are 21 centers in our analysis. In each center, the sampling fraction was chosen to be approximately 0.5e1% higher than the estimated baseline prevalence of type 2 diabetes in that center's population.
We estimate the association between self-reported physical activity and incident type 2 diabetes using data from InterAct and the five models previously described. This association has already been investigated and reported elsewhere [17]; here the same data are used to explore how the five different methods for stratified casee cohort data perform. Table 2 provides an overview of InterAct. Sampling fractions ranged from 2.6% to 10.5% across the centers. Physical activity was self-reported as 1 (inactive), 2 (moderately inactive), 3 (moderately active), or 4 (active) and included in the models as a continuous variable. The only confounder included was sex. In all models we used age in years as the timescale, as recommended for observational studies [18]. Borgan's estimator assumes there are no tied events, so tied event-times were changed by up to 0.01 days. Deaths from any cause were regarded as censored observations.
The results of applying Models IeV to the InterAct data are shown in Fig. 1. Greater physical activity is associated with a lower risk of diabetes. For most of the models and estimators the hazard ratios for physical activity are very similar. The only exceptions are the Barlow and Borgan estimators with Model I, which ignores center effects. The slightly lower estimates for Models IV and V indicate possible differential confounding by sex across centers. The wider confidence interval for Model V reflects the fact that there is heterogeneity in the hazard ratio across centers. For the methods that use the Prentice estimator, the assumption of proportional hazards was tested by adding to the models an interaction between physical activity and the underlying timescale (age). In Models IV and V, meta-analyses of the interaction parameter estimates across centers were then performed [19]. The interaction terms were not statistically significant (P O 0.5 in all models). A test of proportional hazards between the 21 centers in Model II yielded c 2 20 5 159.0 (P ! 0.001), suggesting that Models IIIeV should be preferable.
The published investigation of physical activity and incident type 2 diabetes [17] used Model V and found that the hazard ratio for a one-category increase in physical activity was 0.87 in men (with 95% confidence interval 0.80, 0.94) and 0.93 in women (0.89, 0.98). As well as analyzing the sexes separately, these analyses used more covariates and a different grouping of the centers, so the results are not directly comparable with those reported here.

Comparisons of the models using artificial data sets
Given the general similarity of the results for Models IeV found above, we next used artificial data sets to demonstrate circumstances in which the five models give different estimates of an exposureeoutcome association. For each consecutive pair of models, we created an artificial data set from the more complex of the two models and then estimated the log hazard ratio using both models. A model was deemed to have estimated the hazard ratio accurately if the 95% confidence interval contained the true value. A set of four ''realistic'' data sets was created with specifications (size, sampling fractions, effect sizes, etc.) similar to the InterAct study. A set of four smaller ''extreme'' data sets was also created to show clearly the differences between the models.
We used a normally distributed exposure and a single binary-valued confounder. As above the strata for subcohort selection are referred to as ''centers.'' As far as possible, the same specifications were used for multiple data sets. The specifications are shown in Table 3 and computer code for generating the data is given in Appendix B. For Models I and II, we used only estimators (a) and (c), and for all data sets we also used Model V.
For all data sets the true value of the exposure hazard ratio was 1.5. The data sets for Models IV/V had different exposure hazard ratios in each center, but these were chosen so that the mean of the b j 's (see section 3) was log 1:5. Fig. 2 shows the estimated hazard ratios per unit increase in the exposure, and 95% confidence intervals, obtained from fitting the relevant pair of models to each data set. For the extreme data sets, the confidence intervals from the simpler models do not contain the true value of 1.5, whereas those from the more complex models do. For Models I/II and IV/V, the same is true with the realistic data sets. In all cases Model V appeared to provide a reliable analysis.
For the data sets for Models I/II, the centers with higher risks also had higher average exposure values. Hence center and exposure were confounded, and Model I's failure to allow for center effects led to incorrect results. For the data sets for Models II/III, nonproportional hazards across centers were achieved by using Weibull distributions with different shape parameters. Centers with higher risks also had higher average exposure values. Model II assumes proportional hazards across centers and so can give incorrect results as shown by the ''extreme'' example. For Models III/IV, the centers with higher risks and higher average exposure values also had lower hazard ratios for the confounder. Not allowing for the varying confounding effect across centers in Model III led to incorrect results with the ''extreme'' data set. For Models IV/V, larger centers had higher exposure hazard ratios. Larger centers are given more weight in fixed-effect than in random-effect metaanalysis, so Model IV gave higher hazard ratio estimates than Model V. The heterogeneity in the exposure hazard ratios also led to wide confidence intervals with Model V.
To investigate other aspects of the estimators' performance we performed a simulation study, simulating 200 data sets for each of the ''realistic'' specifications from Table 3 (a larger simulation study was impractical for computational reasons). For Models I and II we used only Prentice's estimator, because Borgan's estimator would be too time-consuming to calculate for this many data sets.   Wb(a,b) is the Weibull distribution with scale a and shape b, which has hazard function hðtÞ5ðb=aÞðt=aÞ bÀ1 ; Cx is ''center x''; n/a is ''not applicable.'' As above, we fitted Model V to all the data sets, partly to see whether it was much less efficient than the simpler models. For each model we recorded the coverage rate, mean bias, and mean squared error of the exposure log hazard ratio, and for the more complex models we recorded the mean of the relative standard error compared to the simpler modeldthe larger this ratio, the less efficient the more complex model is. For 200 simulations, the standard error of the estimated coverage is approximately 0.015, which is small enough to show some of the differences between the models.
The results are shown in Table C.1 in Appendix C. According to the mean relative standard errors, the more complex models were mostly only slightly less efficient than the simpler ones, with the exception that Model V was much less efficient than Model IV when applied to the data sets for Models IV/V. However, Model V had less bias and thus much lower mean squared error than Model IV. Overall, in terms of coverage, mean bias, and mean squared error, the more complex models and Model V performed either very similarly to the simpler models or better than them, and Model V performed far better than Model IV.

Discussion
We have described five models for the analysis of data from a stratified caseecohort study, applied them to a study of physical activity and diabetes incidence, and used artificial data sets to investigate circumstances in which they can give different results. Model I, which took no account of the stratification, gave unreliable results in both the diabetes study and the artificial examples, regardless of which estimator was used. Even the Borgan estimator gave incorrect results with Model I in our examples. This suggests that the statistical model has to explicitly incorporate the stratification. It is also necessary to consider assumptions such as proportional hazards and homogeneous covariate effects across strata, as these may fail to hold in particular studies. However, we found that Model V, the most complex model, gave reliable results whether or not these assumptions held. Thus it is a method that can be generally recommended. All the models described are simple to fit with standard software (see Appendix B).
The main drawback of Model V is that if some strata have very few events, then fitting a separate Cox model for each one might not be possible. If that happens then data from an appropriate grouping of strata could be analyzed using Model II or III. Another drawback of Model V is that assessing the assumption of proportional hazards is cumbersome, since it needs to be done for each stratum and then an overall assessment made by meta-analysis of the nonproportionality parameter estimates [19]. Similarly, it is cumbersome to assess the shape of association when there is a separate model for each stratum. Another possible drawback is that random-effect meta-analysis is not appropriate if there are very few strata, because the estimate of the between-strata variance will not be precise.
There are several issues that we have not addressed. For example, we have not considered missing data or whether multiple imputation would be easier to perform in some models than others. If multiple imputation was used with Model V, it would need to be done in each stratum separately, using Rubin's rules, before the meta-analysis [20]. We have only considered methods based on Cox regression, with a nonparametric baseline hazard, since these seem to be used almost exclusively in practice [21], but parametric survival models for stratified caseecohort studies could be developed. These might have particular relevance for risk prediction [22], whereas the focus of this article has been on estimating the risk association of a particular exposure.
The artificial data sets were deliberately created to show that the models can give different results in certain plausible scenarios. Our main purpose was not to investigate the statistical properties of the five models, such as bias or coverage. However, the simulation study with 200 data sets for each of the realistic data set specifications gave similar results, providing reassurance about the stability of our findings. In particular, it provided evidence that Model V performs well in terms of bias and coverage and can be almost as efficient as the simpler models.
There are alternative possibilities for both the selection of the subcohort and the stratification in the analysis model. Subcohorts can be selected by more elaborate sampling schemes, such as using a formula in terms of baseline covariates to specify the relative probability of selection for each participant [23]. If the formula gives the same probability for different participants, for example because it uses only discrete-valued covariates, then this is similar to the stratified selection discussed in this article.
In terms of analysis, Model III can be adapted by stratifying the Cox model differently from the stratification of the subcohort-selection. For InterAct, the model could be stratified by country (a coarser stratification than the subcohort-selection), or by center and sex (a finer stratification than the subcohort-selection), or just by sex. Stratifying the model more finely than the subcohort-selection might be necessary if the assumption of proportional hazards was not met for some covariates. Similarly, Models IV and V could be adapted by using different subsets of the data for the separate Cox models.
In their discussion of different types of stratification, Langholz and Jiao [11] describe two situations for stratified caseecohort studies. In the ''exposure stratified'' situation, the subcohort selection is usually stratified according to a correlate of the exposure of interest, to improve the efficiency when estimating the exposure's effect size [1], and the data are analyzed with an unstratified Cox model. In the ''confounder stratified'' situation, the strata might be population groups or centers, and the data are analyzed with a stratified Cox model that uses the same stratification, to account for differences between the strata that might otherwise lead to bias. Our five models can be used regardless of the motivation behind the study design (although Model I should be avoided if the stratification variable is confounding the exposureeoutcome association). However, we have focused more on stratifications that correspond to population groups or centers, as in InterAct and EPIC-CVD [24], which is another caseecohort study based on the EPIC-Europe cohort.

Supplementary material
The appendices for this article are provided as supplementary material at http://dx.doi.org/10.1016/j.jclinepi. 2015.04.007.