## Abstract

### Objectives

### Study Design and Setting

### Results

### Conclusion

## Keywords

**What is new?**

- •A complete case logistic regression will give a biased estimate of the exposure odds ratio if the probability of being a complete case depends on a continuous outcome but a binary version of this outcome is used in the analysis; this bias is likely to be small unless the association between the continuous outcome and the chance of being a complete case is strong. If there is an interaction between the exposure and outcome in terms of the probability of being a complete case, there could be substantial bias in the estimate of the log odds ratio.
- •If an interaction is present, including one or more auxiliary variables that are good predictors of the missing binary outcome in multiple imputation (MI), models will lead to relatively large bias reductions if these variables have high sensitivity and specificity in relation to the binary outcome; if not, the bias reductions will be small.

### Key findings

- •It is known that a complete case logistic regression will give an unbiased estimate of the exposure odds ratio if the probability of being a complete case depends on the outcome and exposure independently. We show that this does not hold when the probability of being a complete case depends on an underlying continuous outcome and a binary form of this is used for analysis.

### What this adds to what was known?

- •If one or more good predictors of the missing outcome are available, we would recommend using MI over a complete case analysis because, in practice, it would be difficult to rule out an interaction.

### What is the implication and what should change now?

## 1. Introduction

## 2. Methods

### 2.1 Linkage to general practitioner data

### 2.2 Analysis of ALSPAC data

*mi impute chained*command was used to carry out the imputations; 100 datasets were imputed with a burn-in of 20 iterations.

### 2.3 Simulation study

#### 2.3.1 Simulated datasets

#### 2.3.2 Generating the missing data

- (i)The probability of the outcome being observed was only associated with the continuous outcome
- (ii)The log probability of the outcome being observed depended linearly on the exposure, continuous outcome, and their interaction (note that, henceforth, where we refer to an interaction, this is what we mean)

#### 2.3.3 Scenarios investigated

Factor 4: % Missing linked data | Factor 3: Interaction between outcome & exposure with respect to probability of being observed | Factor 2: Sensitivity of GP depression |
---|---|---|

0% | No | 25% |

0% | No | 75% |

0% | Yes | 25% |

0% | Yes | 75% |

25% | Yes | 75% |

#### 2.3.4 Statistical analysis

## 3. Results

### 3.1 Bias in complete case analysis

### 3.2 Simulation study

### 3.3 Analysis of ALSPAC data

Complete data on: | Linked GP data | Total | |||
---|---|---|---|---|---|

Covariates | Maternal smoking status in pregnancy | Depression status (CIS-R) | Yes | No | |

Yes | Yes | Yes | 2,201 | 517 | 2,718 |

No | 2,923 | 1,386 | 4,309 | ||

No | Yes | 180 | 40 | 220 | |

No | 280 | 135 | 415 | ||

No | Yes | Yes | 830 | 185 | 1,015 |

No | 2,196 | 989 | 3,185 | ||

No | Yes | 478 | 106 | 584 | |

No | 1,472 | 648 | 2,120 | ||

10,560 | 4,006 | 14,566 |

*Abbreviations:*ALSPAC, avon longitudinal study of parents and children; CIS-R, revised clinical interview schedule; GP, general practitioner.

#### 3.3.1 Association between ALSPAC-measured and GP-recorded depression

GP measure | Present? | CIS-R diagnosis of depression | |
---|---|---|---|

No | Yes | ||

Current diagnosis or symptoms or treatment | No | 3,012 (97.7%) | 199 |

Yes | 72 | 71 (26.3%) | |

Future diagnosis or symptoms or treatment | No | 2,500 (79.6%) | 126 |

Yes | 640 | 156 (55.3%) | |

Historical diagnosis or symptoms or treatment | No | 3,233 (96.2%) | 217 |

Yes | 127 | 64 (22.8%) |

*Abbreviations:*ALSPAC, avon longitudinal study of parents and children; CIS-R, revised clinical interview schedule; GP, general practitioner.

#### 3.3.2 Predictors of observed ALSPAC-measured depression data

*n*= 4,468). Using logistic regression and after adjusting for covariates (including the exposure, maternal smoking in pregnancy), individuals with a future depression record were less likely to have CIS-R depression data; the association was weaker with current and historical depression. This suggests that the outcome, depression, is likely to be MNAR conditional on the exposure and covariates; the addition of the auxiliary variables (GP-recorded depression) should reduce this dependency of missingness on the outcome–that is, should give a better approximation to missing at random (MAR).

Variable | Present? | Odds ratio (OR) (95% CI) | P-value |
---|---|---|---|

Historical diagnosis or symptoms or treatment | Yes | 0.88 (0.68, 1.15) | P = 0.4 |

Current diagnosis or symptoms or treatment | Yes | 0.81 (0.59, 1.11) | P = 0.2 |

Future diagnosis or symptoms or treatment | Yes | 0.76 (0.66, 0.88) | P < 0.001 |

*Abbreviations:*ALSPAC, avon longitudinal study of parents and children; GP, general practitioner.

*P*= 0.9; and RR for interaction with future depression = 0.93 (0.77, 1.12),

*P*= 0.5, when added to a binomial regression model including a restricted set of covariates (sex, mother's education, mother's age, parity, social class, and number of rooms)]. These covariates were selected on the basis of their strength of association with having observed depression data; only a restricted set of covariates could be included because models including additional covariates did not converge. Based on this, we would expect the estimate of the OR from the complete case logistic regression to be approximately unbiased if this association with the chance of having observed data depended on this binary measure of depression and not an underlying continuous measure of depression. However, we note that–firstly–this was not the CIS-R measure of depression but proxy is, GP-recorded depression, and–secondly–the confidence intervals (CI) for these interactions are quite wide. In the multiply imputed data, a binomial regression (with a log link) for having observed CIS-R depression showed no evidence for an interaction between maternal smoking in pregnancy and CIS-R depression (RR for interaction = 0.97 (0.65, 1.44),

*P*= 0.9). Thus, under an assumption that CIS-R depression is MAR given the covariates and the GP depression variables, there would be - again - no evidence to reject the assumption required for unbiasedness of the complete case OR estimate (as above, if this association with the chance of having observed data depended on this binary measure of depression and not an underlying continuous measure).

#### 3.3.3 Relationship between maternal smoking in pregnancy and offspring depression

Analysis approach | Crude OR (95% CI) | Adjusted OR (95% CI) | Gain in precision (adjusted log OR) |
---|---|---|---|

Complete case (n = 2,718) | 1.72 (1.20, 2.46) | 1.36 (0.92, 2.02) | n/a |

MI (n = 14,566) | 1.86 (1.44, 2.40) | 1.46 (1.06, 2.01) | 24% |

*Abbreviations:*MI, multiple imputation; OR, odds ratio.

## 4. Discussion

## Acknowledgments

## Supplementary data

- Supplementary Material

## References

- Multiple Imputation and its Application.Wiley, Chichester, UK2013
- Asymptotically unbiased estimation of exposure odds ratios in complete records logistic regression.
*Am J Epidemiol.*2015; 182: 730-736 - Why psychiatric research must abandon traditional diagnostic classification and adopt a fully dimensional scope: two solutions to a persistent problem.
*Front Psychiatry.*2017; 8: 101 - Henderson J., et al. Cohort profile: the ‘children of the 90s’—the index offspring of the Avon longitudinal study of parents and children.
*Int J Epidemiol.*2013; 41: 111-127 - Davey Smith G. et al. Cohort profile: the Avon longitudinal study of parents and children: ALSPAC mothers cohort.
*Int J Epidemiol.*2013; 41: 97-110 - (Available at:)http://www.bristol.ac.uk/alspac/researchers/our-data/Date accessed: June 20, 2021
- Measuring psychiatric disorder in the community: a standardized assessment for use by lay interviewers.
*Psychol Med.*1992; 22: 465-486 - Defining adolescent common mental disorders using electronic GP data: a comparison with outcomes measured using the CIS-R.
*BMJ Open.*2016; 6e013167 - The proportion of missing data should not be used to guide decisions on multiple imputation.
*J Clin Epidemiol.*2019; 110: 63-73 - Using simulation studies to evaluate statistical methods.
*Stat Med.*2019; 38: 2074-2102 - Bjørngaard J.H. et al. Maternal Smoking in Pregnancy and Offspring Depression: a cross cohort and negative control study.
*Sci Rep.*2017; 7: 12579 - A comparison of inclusive and restrictive strategies in modern missing data procedures.
*Psychol Methods.*2001; 6: 330-351 - Multiple imputation using linked proxy outcome data resulted in important bias reduction and efficiency gains: a simulation study.
*Emerg Themes Epidemiol.*2017; 14 - Using auxiliary data for parameter estimation with non-ignorably missing outcomes.
*J R Stat Soc Ser C Appl Stat.*2001; 50: 361-373

## Article info

### Publication history

### Footnotes

Author statement: Rosie Cornish: Conceptualisation, methodology, formal analysis, writing–original draft, reviewing and editing. Jonathan Bartlett: methodology, software, investigation, writing–review and editing. John Macleod–supervision, writing–review and editing. Kate Tilling: Conceptualisation, methodology, supervision, writing–review and editing.

Declaration of interests: The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethics approval and consent to participate: Ethical approval for the study was obtained from the ALSPAC Ethics and Law Committee and the Local Research Ethics Committees (NHS Haydock REC: 10/H1010/70). All procedures contributing to this work comply with the ethical standards of the relevant national and institutional committees on human experimentation and with the Helsinki Declaration of 1975, as revised in 2008. Informed consent for the use of questionnaire and clinic data were obtained from participants following recommendations of the ALSPAC Ethics and Law Committee at the time. Study participants who complete questionnaires consent to the use of their data by approved researchers. Up until age 18 an overarching informed parental consent was used to indicate parents were happy for their child (the study participant) to take part in ALSPAC. Consent for data collection and use was implied via the written completion and return of questionnaires. Study participants have the right to withdraw their consent for specific elements of the study or from the study as a whole, at any time. At age 18, study children were sent ‘fair processing’ materials describing ALSPAC's intended use of their health and administrative records and were given clear means to consent or object via a written form. Data were not extracted for participants who objected, or who were not sent fair processing materials.

Consent for publication: Not applicable.

Availability of data and materials: Due to ALSPAC data access permissions, the authors do not have the authority to share the study data analyzed in this study, but any researcher can apply to use ALSPAC data, including the variables used in this investigation. Information about access to ALSPAC data is given on their website (http://www.bristol.ac.uk/alspac/researchers/access/). The code used to generate the simulated datasets is available from the corresponding author on reasonable request.

Competing interests: The authors declare that they have no competing interests.

Funding: This work was supported by the Medical Research Council (MR/L012081). The UK Medical Research Council and the Wellcome Trust (Grant ref: 217065/Z/19/Z) and the University of Bristol provide core support for ALSPAC. Data collection is funded from a range of sources. KT and RC work in the MRC Integrative Epidemiology Unit which receives funding from the UK Medical Research Council and the University of Bristol (MC_UU_00011/3). JB was supported by a UK Medical Research Council grant (MR/T023953/1). JM is partly funded by the National Institute for Health Research Collaboration West (NIHR ACR West) at University Hospitals Bristol and Weston NHS Foundation Trust, UK.

Authors’ contributions: RC and KT conceived and designed the study, with input from JB. JB derived the expression used to calculate the bias in the complete case estimate of the log OR. RC ran the simulations and conducted the analyses. RC, KT, and JB interpreted the results. RC wrote the first draft of the manuscript with substantial contributions from KT and JB. RC, KT, JB, and JM revised and edited the manuscript. All authors read and approved the final manuscript.

### Identification

### Copyright

### User license

Creative Commons Attribution (CC BY 4.0) |## Permitted

- Read, print & download
- Redistribute or republish the final article
- Text & data mine
- Translate the article
- Reuse portions or extracts from the article in other works
- Sell or re-use for commercial purposes

Elsevier's open access license policy