Advertisement

Specific agreement on dichotomous outcomes can be calculated for more than two raters

  • Henrica C.W. de Vet
    Correspondence
    Corresponding author. Tel.: +31-20-4446014; fax: +31-20-4446775.
    Affiliations
    Department of Epidemiology and Biostatistics, EMGO Institute for Health and Care Research, VU Medical Center, De Boelelaan 1089A, Amsterdam 1081HV, The Netherlands
    Search for articles by this author
  • Rieky E. Dikmans
    Affiliations
    Department of Epidemiology and Biostatistics, EMGO Institute for Health and Care Research, VU Medical Center, De Boelelaan 1089A, Amsterdam 1081HV, The Netherlands
    Search for articles by this author
  • Iris Eekhout
    Affiliations
    Department of Epidemiology and Biostatistics, EMGO Institute for Health and Care Research, VU Medical Center, De Boelelaan 1089A, Amsterdam 1081HV, The Netherlands
    Search for articles by this author

      Abstract

      Objective

      For assessing interrater agreement, the concepts of observed agreement and specific agreement have been proposed. The situation of two raters and dichotomous outcomes has been described, whereas often, multiple raters are involved. We aim to extend it for more than two raters and examine how to calculate agreement estimates and 95% confidence intervals (CIs).

      Study Design and Setting

      As an illustration, we used a reliability study that includes the scores of four plastic surgeons classifying photographs of breasts of 50 women after breast reconstruction into “satisfied” or “not satisfied.” In a simulation study, we checked the hypothesized sample size for calculation of 95% CIs.

      Results

      For m raters, all pairwise tables [ie, m (m − 1)/2] were summed. Then, the discordant cells were averaged before observed and specific agreements were calculated. The total number (N) in the summed table is m (m − 1)/2 times larger than the number of subjects (n), in the example, N = 300 compared to n = 50 subjects times m = 4 raters. A correction of n√(m − 1) was appropriate to find 95% CIs comparable to bootstrapped CIs.

      Conclusion

      The concept of observed agreement and specific agreement can be extended to more than two raters with a valid estimation of the 95% CIs.

      Keywords

      To read this article in full you will need to make a payment

      Purchase one-time access:

      Academic & Personal: 24 hour online accessCorporate R&D Professionals: 24 hour online access
      One-time access price info
      • For academic or personal research use, select 'Academic and Personal'
      • For corporate R&D use, select 'Corporate R&D Professionals'

      Subscribe:

      Subscribe to Journal of Clinical Epidemiology
      Already a print subscriber? Claim online access
      Already an online subscriber? Sign in
      Institutional Access: Sign in to ScienceDirect

      References

        • Dunn G.
        Design and analysis of reliability studies. The statistical evaluation of measurement errors.
        Oxford University Press, New York NY1989
        • Shoukri M.M.
        Measures of interobserver agreement.
        Chapman & Hall/CRC, Boca Raton FL2004
        • Lin L.
        • Hedayat A.S.
        • Wu W.
        Statistical tools for measuring agreement.
        Springer, New York NY2012
        • Gwet K.L.
        Handbook of inter-rater reliability. A definitive guide to measuring the extent of agreement among raters.
        3rd ed. Advanced Analytics, LLC, Gaithersburg MD2012
        • Cohen J.
        A coefficient of agreement for nominal scales.
        Educ Psychol Meas. 1960; 20: 37-46
      1. http://www.john-uebersax.com/stat/kappa.htm#procon. Kappa coefficients: a critical appraisal (accessed 2016).

        • Byrt T.
        • Bishop J.
        • Carlin J.B.
        Bias, prevalence and kappa.
        J Clin Epidemiol. 1993; 46: 423-429
        • Lantz C.A.
        • Nebenzahl E.
        Behaviour and interpretation of the k statistic. Resolution of the two paradoxes.
        J Clin Epidemiol. 1996; 49: 431-434
        • De Vet H.C.
        • Mokkink L.B.
        • Terwee C.B.
        • Hoekstra O.S.
        • Knol D.L.
        Clinicians are right not to like Cohen's κ.
        BMJ. 2013; 346: f2125
        • Kottner J.
        • Audigé L.
        • Brorson S.
        • Donner A.
        • Gajewski B.J.
        • Hrobjartsson A.
        • et al.
        Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed.
        J Clin Epidemiol. 2011; 64: 96-106
        • Dice L.R.
        Measures of the amount of ecologic association between species.
        Ecology. 1945; 26: 297-302
        • Cicchetti D.V.
        • Feinstein A.R.
        High agreement but low kappa: II Resolving the paradoxes.
        J Clin Epidemiol. 1990; 43: 551-558
      2. Dikmans REG, Nene L, de Vet HCW, Mureau MAM, Bouman MB, Ritt MJPF, et al. The aesthetic items scale: a tool for the evaluation of aesthetic outcome after breast reconstruction. Plast Reconstr Surg Glob Open (in press).

        • Burton A.
        • Altman D.G.
        • Royston P.
        • Holder R.L.
        The design of simulation studies in medical statistics.
        Stat Med. 2006; 25: 4279-4292
        • R Core Team
        R: A language and environment for statistical computing.
        R Foundation for Statistical Computing, Vienna, Austria2016 (Available at) (Accessed October 19, 2016)
        • Efron B.
        • Tibshirani R.
        Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy.
        Stat Sci. 1986; 1: 54-75
        • Fleiss J.L.
        • Kevin B.
        • Paik M.C.
        Statistical methods for rates and proportions.
        3rd ed. John Wiley & Sons, Inc, Hoboken, NJ2003
        • Light R.J.
        Measures of response agreement for qualitative data: some generalizations and alternatives.
        Psychol Bull. 1971; 76: 365-377
        • Feinstein A.R.
        • Cicchetti D.V.
        High agreement but low kappa: I. The problem of two paradoxes.
        J Clin Epidemiol. 1990; 43: 543-549
        • Landis J.R.
        • Koch G.C.
        The measurement of observer agreement for categorical data.
        Biometrics. 1977; 33: 159-174
        • Vach W.
        The dependence of Cohen's kappa on the prevalence does not matter.
        J Clin Epidemiol. 2000; 58: 655-661
        • Hallgren K.A.
        Computing inter-rater reliability for observational data: an overview and tutorial.
        Tutor Quant Methods Psychol. 2012; 8: 23-34