Abstract
Objective
Results of reliability and agreement studies are intended to provide information about
the amount of error inherent in any diagnosis, score, or measurement. The level of
reliability and agreement among users of scales, instruments, or classifications is
widely unknown. Therefore, there is a need for rigorously conducted interrater and
intrarater reliability and agreement studies. Information about sample selection,
study design, and statistical analysis is often incomplete. Because of inadequate
reporting, interpretation and synthesis of study results are often difficult. Widely
accepted criteria, standards, or guidelines for reporting reliability and agreement
in the health care and medical field are lacking. The objective was to develop guidelines
for reporting reliability and agreement studies.
Study Design and Setting
Eight experts in reliability and agreement investigation developed guidelines for
reporting.
Results
Fifteen issues that should be addressed when reliability and agreement are reported
are proposed. The issues correspond to the headings usually used in publications.
Conclusion
The proposed guidelines intend to improve the quality of reporting.
Keywords
To read this article in full you will need to make a payment
Purchase one-time access:
Academic & Personal: 24 hour online accessCorporate R&D Professionals: 24 hour online accessOne-time access price info
- For academic or personal research use, select 'Academic and Personal'
- For corporate R&D use, select 'Corporate R&D Professionals'
Subscribe:
Subscribe to Journal of Clinical EpidemiologyAlready a print subscriber? Claim online access
Already an online subscriber? Sign in
Register: Create an account
Institutional Access: Sign in to ScienceDirect
References
- Statistical evaluation of measurement errors: design and analysis of reliability studies.2nd edition. Arnold, London, UK2004
- Interrater reliability in clinical trials of depressive disorders.Am J Psychiatry. 2002; 159: 1598-1600
- Epidemiology beyond the basics.2nd edition. Jones and Bartlett, Sudbury, MA2007
- Nursing research: generating and assessing evidence for nursing practice.8th edition. Lippincott Williams & Wilkins, Philadelphia, PA2008
- Measures of interobserver agreement.Chapman & Hall/CRC, Boca Raton, FL2004
- Health measurement scales: a practical guide to their development and use.4th edition. Oxford University Press, Oxford, UK2008
- A note on the use of the intraclass correlation coefficient in the evaluation of agreement between two methods of measurement.Comput Biol Med. 1990; 20: 337-340
- When to use agreement versus reliability measures.J Clin Epidemiol. 2006; 59: 1033-1039
- High agreement but low kappa: I. the problems of two paradoxes.J Clin Epidemiol. 1990; 43: 543-549
- Computing inter-rater reliability and its variance in the presence of high agreement.Br J Math Stat Psychol. 2008; 61: 29-48
- A methodological framework for assessing health indices.J Chron Dis. 1985; 38: 27-36
- Quality criteria were proposed for measurement properties of health status questionnaires.J Clin Epidemiol. 2007; 60: 34-42
- Clinimetric evaluation of shoulder disability questionnaires: a systematic review.Ann Rheum Dis. 2004; 63: 335-341
- Interpreting interrater reliability coefficients of the Braden scale: A discussion paper.Int J Nurs Stud. 2008; 45: 1239-1246
- Inter- and intrarater reliability of the Waterlow pressure sore risk scale: a systematic review.Int J Nurs Stud. 2009; 46: 369-379
- A systematic review of interrater reliability of pressure ulcer classification systems.J Clin Nurs. 2009; 18: 315-336
- Is real-time ultrasonic bladder volume estimation reliable and valid? A systematic overview.Scand J Urol Nephrol. 1998; 32: 325-330
- The reported validity and reliability of methods for evaluating continuing medical education: a systematic review.Acad Med. 2008; 83: 274-283
- Manual examination of the spine: a systematic critical literature review of reproducibility.J Manipulative Physiol Ther. 2006; 29: 475-485
- Reliability and validity of clinical outcome measurements of osteoarthritis of the hip and knee— a review of the literature.Clin Rheumatol. 1997; 16: 185-198
- Observer variation in chest radiography of acute lower respiratory infections in children: a systematic review.BMC Med Imaging. 2001; 1: 1
- How reliable are reliability studies of fracture classifications?.Acta Orthop Scand. 2004; 75: 184-194
- Training improves agreement among doctors using the Neer system for proximal humeral fractures in a systematic review.J Clin Epidemiol. 2008; 61: 7-16
- Are chiropractic tests for the lumbo-pelvic spine reliable and valid? A systematic review.J Manipulative Physiol Ther. 2000; 23: 258-275
- Reliability of work-related assessments.Work. 1999; 13: 107-124
- Standards for educational and psychological testing.American Educational Research Association, Washington, DC1999
- Using behavioral science strategies for defining the state-of-the-art.J Appl Behav Sci. 1980; 16: 79-92
- Consensus methods: characteristics and guidelines for use.Am J Public Health. 1984; 74: 979-983
- Clinimetrics versus psychometrics: an unnecessary distinction.J Clin Epidemiol. 2003; 56: 1142-1145
- Inter-observer variation in the assessment of skin ulceration.J Wound Care. 1996; 5: 166-170
- Kappa statistics in the assessment of observer variation: the significance of multiple observers classifying ankle fractures.J Orthop Sci. 2002; 7: 163-166
- Impact of quality scales on levels of evidence inferred from a systematic review of exercise therapy and low back pain.Arch Phys Med Rehabil. 2002; 83: 1745-1752
- Mapping the categories of the Swedish primary health care version of ICD-10 to SNOMED CT concepts: rule development and intercoder reliability in a mapping trial.BMC Med Inform Decis Making. 2007; 2: 9
- Representation of ophthalmology concepts by electronic systems: intercoder agreement among physicians using controlled terminologies.Ophthalmology. 2006; 113: 511-519
- An assessment of the inter examiner reliability of clinical tests for subacromial impingement and rotator cuff integrity.Eur J Orthop Surg Traumatol. 2008; 18: 495-500
- Intertester reliability and diagnostic validity of the cervical flexion-rotation test.J Manipulative Physiol Ther. 2008; 31: 293-300
- Medical statistics: a guide to data analysis and critical appraisal.Blackwell, Oxford, UK2005
- Rating scales, scales of measurement, issues of reliability.J Nerv Ment Dis. 2006; 194: 557-564
- Ramifications of a population model for κ as a coefficient of reliability.Psychometrika. 1979; 44: 461-472
- Agreement, reliability, accuracy, and validity: toward a clarification.Behav Assess. 1988; 10: 343-366
- International Association for Pediatric Traumatology Development and validation of the AO Pediatric Comprehensive Classification of Long Bone Fractures by the pediatric expert group of the AO Foundation in collaboration with AO Clinical Investigation and Documentation and the International Association for Pediatric Traumatology.J Pediatr Orthop. 2006; 26: 43-49
- Should an Allen test be performed before radial artery cannulation?.J Trauma. 2006; 61: 468-470
- Inter-rater reliability of delirium rating scales.Neuroepidemiology. 2005; 25: 48-52
- Kappa coefficients in medical research.Stat Med. 2002; 21: 2109-2129
- Misinterpretation and misuse of the kappa statistic.Am J Epidemiol. 1987; 126: 161-169
- A critical discussion of intraclass correlation coefficients.Stat Med. 1994; 13: 2465-2476
- Assessing intrarater, interrater and test-retest reliability of continuous measurements.Stat Med. 2002; 21: 3431-3446
- The kappa coefficient and the prevalence of a diagnosis.Methods Inf Med. 1988; 27: 184-186
- Rater training in multicenter clinical trials: issues and recommendations.J Clin Psychopharmacol. 2004; 24: 113-117
- Measurement reliability and agreement in psychiatry.Stat Methods Med Res. 1998; 7: 301-317
- A new approach to rater training and certification in a multicenter clinical trial.J Clin Psychopharmacol. 2005; 25: 407-411
- A general goodness-of-fit approach for inference procedures concerning the kappa statistic.Stat Med. 2001; 20: 2479-2488
- Sample-size calculations for Cohen's kappa.Psychol Methods. 1996; 1: 150-153
- Sample size requirements for increasing the precision of reliability estimates: problems and proposed solutions.J Clin Exp Neuropsychol. 1999; 21: 567-570
- Sample size requirements for reliability studies.Stat Med. 1987; 6: 441-448
- A goodness-of-fit approach to inference procedures for the kappa statistic: confidence interval construction, significance testing and sample size estimation.Stat Med. 1992; 11: 1511-1519
- Planning a reproducibility study: how many subjects and how many replicates per subject for an expected width of the 95 percent confidence interval of the intraclass correlation coefficient.Stat Med. 2001; 15: 3205-3214
- Sample size requirements for the design of reliability study: review and new results.Stat Methods Med Res. 2004; 13: 251-271
- Sample size and optimal designs for reliability studies.Stat Med. 1998; 17: 101-110
- The precision of reliability and validity estimates re-visited: distinguishing between clinical and statistical significance of sample size requirements.J Clin Exp Neuropsychol. 2001; 23: 695-700
- Intraclass correlations: uses in assessing rater reliability.Psychol Bull. 1979; 86: 420-428
- A reappraisal of reliability and validity studies in stroke.Stroke. 1996; 27: 2331-2336
- The inter-rater agreement of retrospective assessments of adverse events does not improve with two reviewers per patient record.J Clin Epidemiol. 2010; 63: 94-102
- The development of a National Registration Form to measure the prevalence of pressure ulcers in the Netherland.Ostomy Wound Manage. 1999; 45: 28-40
- Examining the validity of pressure ulcer risk assessment scales: a replication study.Int J Nurs Stud. 2004; 41: 331-339
- Reliability testing of the national database of nursing quality indicators pressure ulcer indicator.J Nurs Care Qual. 2006; 21: 256-265
- The case for comprehensive quality indicator reliability assessment.J Clin Epidemiol. 2001; 54: 1103-1111
- A concept for the validation of fracture classifications.J Orthop Trauma. 2005; 19: 401-406
- Measures of interobserver agreement: calculation formulas and distribution effects.J Behav Assess. 1981; 3: 37-57
- Why we need large, simple studies of clinical examination: the problem and a proposed solution.Lancet. 1999; 354: 1721-1724
- The “Hawthorne effect”: what did the original Hawthorne studies actually show?.Scand J Work Environ Health. 2000; 26: 363-367
- EPUAP statement on prevalence and incidence monitoring of pressure ulcer occurrence.(Available at) (Accessed March 8, 2009)
- Penny-wise and pound-foolish: the impact of measurement error on sample size requirements in clinical trails.Biol Psychiatry. 2000; 47: 762-766
- Peer review of medical care.Med Care. 1972; 10: 29-39
- A comparison of face-to-face and remote assessment of inter-rater reliability on the Hamilton Depression Rating Scale via videoconferencing.Psychiatry Res. 2008; 28: 99-103
- Interrater reliability decline under covert assessment.Nurs Res. 1988; 37: 47-49
- Inter-rater reliability of the EPUAP pressure ulcer classification system using photographs.J Clin Nurs. 2004; 13: 952-959
- On the theory of scales of measurement.Science. 1946; 103: 677-680
- The dependence of Cohen's kappa on the prevalence does not matter.J Clin Epidemiol. 2005; 58: 655-661
- Category distiguishability and observer agreement.Aust N Z J Stat. 1986; 28: 371-388
- Intra-class rank correlation.Biometrika. 1949; 36: 463-467
- A nonparametric measure of intraclass correlation.Biometrika. 1979; 66: 629-639
- Assessing the reliability of ordered categorical scales using kappa-type statistics.Stat Methods Med Res. 2005; 14: 493-514
- Forming inferences about some intraclass correlation coefficients.Psychol Meth. 1996; 1: 30-46
- Statistical implications of the choice between a dichotomous or continuous trait in studies of interobserver agreement.Biometrics. 1994; 50: 550-555
- Statistical methods for rates and proportions.3rd edition. Wiley, Hoboken, NJ2003
- Raw agreement indices.(Available at) (Accessed April 1, 2008)
- Measuring agreement in method comparison studies.Stat Methods Med Res. 1999; 8: 135-160
- Use of the standard error as a reliability index of interest: an applied example using elbow flexor strength data.Phys Ther. 1997; 77: 745-750
- A comparison of methods for calculating a stratified kappa.Stat Med. 1991; 10: 1465-1472
- The statistical analysis of kappa statistics in multiple samples.J Clin Epidemiol. 1996; 49: 1053-1058
- Inter-rater reliability of pressure ulcer staging: probit Bayesian Hierarchical Model that allows for uncertain rater response.Stat Med. 2007; 26: 4602-4618
- Combining reliability coefficients: possible application to meta-analysis and reliability generalisation.Psychol Rep. 2003; 93: 643-647
- Reliability generalisation: exploring variance in measurement error affecting score reliability across studies.Educ Psychol Meas. 1998; 58: 6-20
- Interrater reliability using Modified Norton Scale, Pressure Ulcer Card, Short Form-Mini Nutritional Assessment by registered and enrolled nurses in clinical practice.J Clin Nurs. 2008; 17: 618-626
- Signal-to-noise ratio (SNR) as a measure of reproducibility: design, estimation, and application.Health Serv Outcomes Res Methodol. 2008; 8: 119-133
- Sample size requirements for estimating intraclass correlations with desired precision.Stat Med. 2002; 21: 1331-1335
- Effective number of subjects and number of raters for inter-rater reliability studies.Stat Med. 2006; 25: 1547-1560
- Subepidermal moisture differentiates erythema and stage I pressure ulcers in nursing home residents.Wound Repair Regen. 2008; 16: 189-197
- Reliability of the Barthel Index when used with older people.Age Ageing. 2005; 34: 228-232
- Pressure ulcer prevalence, incidence and associated risk factors in the community.Decubitus. 1993; 6: 24-32
- Sensitivity and specificity of the Braden scale in the cardiac surgical population.J Wound Ostomy Continence Nurs. 2000; 27: 36-41
- Confidence intervals rather than P values: estimation rather than hypothesis testing.Br Med J. 1986; 292: 746-750
- A matrix of kappa-type coefficients to assess the reliability of nominal scales.Stat Med. 1998; 17: 471-488
- The measurement of observer agreement for categorical data.Biometrics. 1977; 33: 159-174
- Statistical evaluation of agreement between two methods for measuring a quantitative variable.Comput Biol Med. 1989; 19: 61-70
- Comparison of observer variation in conventional and three digital radiographic methods used in the evaluation of patients with adolescent idiopathic scoliosis.Spine. 2008; 33: 681-686
- An interrater reliability study of the Braden scale in two nursing homes.Int J Nurs Stud. 2008; 45: 1501-1511
- Psychometric theory.McGraw-Hill, New York, NY1994
- How reliable are assessments of clinical teaching?.J Gen Intern Med. 2004; 19: 971-977
- Reliability and validity of functional capacity evaluation methods: a systematic review with reference to Blankenship system, Ergos work simulator, Ergo-Kit and Isernhagen work system.Int Arch Occup Environ Health. 2004; 77: 527-537
- Goal attainment in paediatric rehabilitation: a critical review of the literature.Dev Med Child Neurol. 2007; 49: 550-556
- The STARD statement for reporting studies of diagnostic accuracy: explanation and elaboration.Ann Intern Med. 2003; 138: W1-W12
- Measurement of reliability for categorical data in medical research.Stat Methods Med Res. 1992; 1: 183-199
- The Delphi technique in health sciences education research.Med Teach. 2005; 27: 639-643
- A comparison of formal consensus methods used for developing clinical guidelines.J Health Serv Res Policy. 2006; 11: 218-224
- A comparison of two consensus methods for classifying morbidities in a single professional group showed the same outcomes.J Clin Epidemiol. 2006; 59: 1169-1173
- Use of consensus development to establish national priorities in critical care.BMJ. 2000; 320: 976-980
Article info
Publication history
Published online: June 18, 2010
Accepted:
March 2,
2010
Identification
Copyright
© 2011 Elsevier Inc. Published by Elsevier Inc. All rights reserved.