Brief Report| Volume 61, ISSUE 7, P722-727, July 2008

Download started.


Reliability studies of diagnostic tests are not using enough observers for robust estimation of interobserver agreement: a simulation study



      Any attempt to generalize the performance of a subjective diagnostic method should take into account the sample variation in both cases and readers. Most current measures of the performance of a test, especially the indices of reliability, only tackle the variation of cases, and hence are not suitable for generalizing results across the population of readers. We attempted to study the effect of readers' variation on two measures of multireader reliability: pair-wise agreement and Fleiss' kappa.

      Study Design and Setting

      We used a normal hierarchical model with a latent trait (signal) variable to simulate a binary decision-making task by different number of readers on an infinite sample of cases.


      It could be shown that both measures, especially Fleiss' kappa, have a large sample variance when estimated by a small number of readers, casting doubt on their accuracy given the number of readers typically used in current reliability studies.


      The majority of the current agreement studies is likely limited by the number of readers and is unlikely to produce a reliable estimate of reader agreement.


      To read this article in full you will need to make a payment

      Purchase one-time access:

      Academic & Personal: 24 hour online accessCorporate R&D Professionals: 24 hour online access
      One-time access price info
      • For academic or personal research use, select 'Academic and Personal'
      • For corporate R&D use, select 'Corporate R&D Professionals'


      Subscribe to Journal of Clinical Epidemiology
      Already a print subscriber? Claim online access
      Already an online subscriber? Sign in
      Institutional Access: Sign in to ScienceDirect


        • Altman D.G.
        • Bland J.M.
        Diagnostic tests. 1: sensitivity and specificity.
        BMJ. 1994; 308: 1552
        • Kraemer H.C.
        Measurement of reliability for categorical data in medical research.
        Stat Methods Med Res. 1992; 1: 183-199
        • Fleiss J.L.
        Statistical methods for rates and proportions.
        2nd ed. John Wiley, New York1981
        • Gur D.
        • Rockette H.E.
        • Maitz G.S.
        • King J.L.
        • Klym A.H.
        • Bandos A.I.
        Variability in observer performance studies experimental observations.
        Acad Radiol. 2005; 12: 1527-1533
        • Roe C.A.
        • Metz C.E.
        Variance-component modeling in the analysis of receiver operating characteristic index estimates.
        Acad Radiol. 1997; 4: 587-600
        • Schouten H.J.A.
        Nominal scale agreement among observers.
        Psychometrika. 1986; 51: 453-466
        • Fleiss J.L.
        Measuring nominal scale agreement among many raters.
        Psychol Bull. 1971; 76: 378-382
        • Uebersax J.S.
        Modeling approaches for the analysis of observer agreement.
        Invest Radiol. 1992 Sep; 27: 738-743
        • Getty D.J.
        • Pickett R.M.
        • D'Orsi C.J.
        • Swets J.A.
        Enhanced interpretation of diagnostic images.
        Invest Radiol. 1988 Apr; 23: 240-252
        • Metz C.E.
        • Shen J.H.
        Gains in accuracy from replicated readings of diagnostic images: prediction and assessment in terms of ROC analysis.
        Med Decis Making. 1992; 12: 60-75
        • Moskowitz M.
        Screening for breast cancer: how effective are our tests? A critical review.
        CA Cancer J Clin. 1983; 33: 26-39
        • Beam C.A.
        • Layde P.M.
        • Sullivan D.C.
        Variability in the interpretation of screening mammograms by US radiologists. Findings from a national sample.
        Arch Intern Med. 1996; 156: 209-213
        • Uebersax J.S.
        Diversity of decision-making models and the measurement of interrater agreement.
        Psychol Bull. 1987; 101: 140-146
        • Uebersax J.S.
        Validity inferences from interobserver agreement.
        Psychol Bull. 1988; 104: 405-416
        • Swets J.A.
        • Dawes R.M.
        • Monahan J.
        Psychological science can improve diagnostic decisions.
        Psychol Sci Public Interest. 2000; 1: 1
        • Swets J.A.
        Signal detection theory and ROC analysis in psychology and diagnostics: Collected papers.
        Erlbaum, Mahwah, NJ1996
        • Lusted L.B.
        Decision processes and observer error in medical diagnosis. Introduction to medical decision making. Charles C Thomas, Spring field, IL1968 (98–140)
        • Berry K.J.
        • Mielke Jr., P.W.
        A generalization of Cohen's kappa agreement measure to interval measurement and multiple raters.
        Educ psychol meas. 1988; 48: 921-933
        • Fleiss J.L.
        • Cohen J.
        The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability.
        Educ psychol meas. 1973; 33: 613-619
        • Flury B.
        A first course in multivariate statistics.
        Springer, New York1997