Which Agreement Measurement Method Should I Use?

A simple guide to use specific method in a project

Posted Jan 10, 2026 Updated Jan 24, 2026

By Eric

3 min read

Which Agreement Measurement Method Should I Use?

When assessing whether two observers, instruments, or methods produce similar results, choosing the right statistical measure of agreement is crucial. This guide will help you navigate the options and select the most appropriate method for your data.

What is Agreement?

Agreement (also called concordance or reproducibility) refers to the degree of concordance between two or more sets of measurements made on the same subjects. This is fundamentally different from correlation:

Aspect	Agreement	Correlation
Purpose	Same measurements?	Related measurements?
Focus	Same variable	Different variables
Key insight	High correlation could poor agreement	Agree, must be correlated

⚠️ Common Pitfall: High correlation does not imply good agreement! Two methods can show r = 0.98 correlation while still having clinically unacceptable differences.

Decision Flowchart: Which Method Should I Use?

Methods for Categorical Data

Cohen’s Kappa (κ)

When to use: Two raters assessing binary or nominal categorical outcomes.

Cohen’s kappa accounts for agreement occurring by chance:

\[\kappa = \frac{P_o - P_e}{1 - P_e}\]

Where:

Po = Observed agreement
Pe = Expected agreement by chance

Interpretation Guide

Kappa Value	Interpretation
< 0	Less than chance agreement
0.01 – 0.20	Slight agreement
0.21 – 0.40	Fair agreement
0.41 – 0.60	Moderate agreement
0.61 – 0.80	Substantial agreement
0.81 – 0.99	Near-perfect agreement
1.00	Perfect agreement

Alternative interpretation: κ < 0.60 indicates a significant level of disagreement.

Limitations

Does not account for magnitude of differences (unsuitable for ordinal data)
Cannot be used with more than two raters
Does not differentiate between agreement for positive vs. negative findings

Weighted Kappa

When to use: Two raters assessing ordinal categorical outcomes where magnitude of disagreement matters.

Example: Microbiologists rating bacterial growth as none, occasional, moderate, or confluent. A discrepancy between “occasional” and “moderate” is less severe than between “none” and “confluent.”

Weighted kappa assigns weights to disagreements:

Linear weighting: Equal penalty per category difference
Quadratic weighting: Larger penalty for bigger differences

Fleiss’ Kappa

When to use: More than two raters assessing binary or ordinal categorical data.

This extends Cohen’s kappa to handle multiple observers rating the same subjects.

Methods for Continuous Data

Intraclass Correlation Coefficient (ICC)

When to use: You need a single summary measure of agreement between continuous measurements.

ICC estimates overall concordance by examining between-pair variance as a proportion of total variance.

ICC Value	Interpretation
< 0.50	Poor
0.50 – 0.75	Moderate
0.75 – 0.90	Good
> 0.90	Excellent

Note: There are multiple forms of ICC depending on your study design (single vs. average measures, consistency vs. absolute agreement).

Bland-Altman Plots

When to use: You want to visualize agreement and determine limits of agreement between two measurement methods.

How it works

Plot the difference between two measurements (Y-axis) against their average (X-axis)
Calculate the mean difference (bias)
Calculate 95% limits of agreement:

\[\text{Limits} = \text{Mean Difference} \pm 1.96 \times \text{SD of Differences}\]

What it tells you

Bias: Systematic difference between methods (mean line)
Precision: How scattered the differences are (SD)
Limits of Agreement: Range within which 95% of differences fall

⚠️ Important: There is no universal criterion for “acceptable” limits. This is a clinical decision based on the variable being measured.

Common Mistakes to Avoid

❌ Using Correlation Instead of Agreement

Even with r = 0.98, two methods may not be interchangeable. Correlation shows relationship; agreement shows concordance.

❌ Using Paired Tests (McNemar’s, Paired t-test)

McNemar’s test: Compares overall proportions, not individual agreement
Paired t-test: Compares mean differences; can be non-significant even when individual differences are large

❌ Using Simple Percent Agreement

Percent agreement ignores chance agreement. Two raters guessing randomly would still show ~50% agreement for binary outcomes.

Quick Reference Summary

Data Type	Raters	Recommended Method
Binary/Nominal	2	Cohen’s Kappa
Ordinal	2	Weighted Kappa
Any Categorical	>2	Fleiss’ Kappa
Continuous	2+	ICC (single index)
Continuous	2	Bland-Altman (visual + limits)

References

Ranganathan P, Pramesh CS, Aggarwal R. Common pitfalls in statistical analysis: Measures of agreement. Perspect Clin Res. 2017;8(4):187-191.

agreement evaluation

This post is licensed under CC BY 4.0 by the author.