Abstract
Objective
Any attempt to generalize the performance of a subjective diagnostic method should
take into account the sample variation in both cases and readers. Most current measures
of the performance of a test, especially the indices of reliability, only tackle the
variation of cases, and hence are not suitable for generalizing results across the
population of readers. We attempted to study the effect of readers' variation on two
measures of multireader reliability: pair-wise agreement and Fleiss' kappa.
Study Design and Setting
We used a normal hierarchical model with a latent trait (signal) variable to simulate
a binary decision-making task by different number of readers on an infinite sample
of cases.
Results
It could be shown that both measures, especially Fleiss' kappa, have a large sample
variance when estimated by a small number of readers, casting doubt on their accuracy
given the number of readers typically used in current reliability studies.
Conclusion
The majority of the current agreement studies is likely limited by the number of readers
and is unlikely to produce a reliable estimate of reader agreement.
Keywords
To read this article in full you will need to make a payment
Purchase one-time access:
Academic & Personal: 24 hour online accessCorporate R&D Professionals: 24 hour online accessOne-time access price info
- For academic or personal research use, select 'Academic and Personal'
- For corporate R&D use, select 'Corporate R&D Professionals'
Subscribe:
Subscribe to Journal of Clinical EpidemiologyAlready a print subscriber? Claim online access
Already an online subscriber? Sign in
Register: Create an account
Institutional Access: Sign in to ScienceDirect
References
- Diagnostic tests. 1: sensitivity and specificity.BMJ. 1994; 308: 1552
- Measurement of reliability for categorical data in medical research.Stat Methods Med Res. 1992; 1: 183-199
- Statistical methods for rates and proportions.2nd ed. John Wiley, New York1981
- Variability in observer performance studies experimental observations.Acad Radiol. 2005; 12: 1527-1533
- Variance-component modeling in the analysis of receiver operating characteristic index estimates.Acad Radiol. 1997; 4: 587-600
- Nominal scale agreement among observers.Psychometrika. 1986; 51: 453-466
- Measuring nominal scale agreement among many raters.Psychol Bull. 1971; 76: 378-382
- Modeling approaches for the analysis of observer agreement.Invest Radiol. 1992 Sep; 27: 738-743
- Enhanced interpretation of diagnostic images.Invest Radiol. 1988 Apr; 23: 240-252
- Gains in accuracy from replicated readings of diagnostic images: prediction and assessment in terms of ROC analysis.Med Decis Making. 1992; 12: 60-75
- Screening for breast cancer: how effective are our tests? A critical review.CA Cancer J Clin. 1983; 33: 26-39
- Variability in the interpretation of screening mammograms by US radiologists. Findings from a national sample.Arch Intern Med. 1996; 156: 209-213
- Diversity of decision-making models and the measurement of interrater agreement.Psychol Bull. 1987; 101: 140-146
- Validity inferences from interobserver agreement.Psychol Bull. 1988; 104: 405-416
- Psychological science can improve diagnostic decisions.Psychol Sci Public Interest. 2000; 1: 1
- Signal detection theory and ROC analysis in psychology and diagnostics: Collected papers.Erlbaum, Mahwah, NJ1996
- Decision processes and observer error in medical diagnosis. Introduction to medical decision making. Charles C Thomas, Spring field, IL1968 (98–140)
- A generalization of Cohen's kappa agreement measure to interval measurement and multiple raters.Educ psychol meas. 1988; 48: 921-933
- The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability.Educ psychol meas. 1973; 33: 613-619
- A first course in multivariate statistics.Springer, New York1997
Article info
Publication history
Published online: May 20, 2008
Accepted:
October 29,
2007
Identification
Copyright
© 2008 Elsevier Inc. Published by Elsevier Inc. All rights reserved.