Cohen's Kappa Explained for ML Engineers

Every data annotation vendor claims high accuracy. The numbers range from 95% to 99%. They appear in pitch decks, on websites, and in contract appendices. And in most cases, they are meaningless.

The reason is straightforward. Accuracy measures whether an annotator completed a task correctly, but it does not measure whether two annotators would have made the same decision on the same task. On simple, unambiguous tasks — "is this image a cat or a dog?" — high accuracy and high agreement go together. On complex tasks — "which of these two model responses is more helpful, accurate, and safe?" — accuracy can be high while agreement is low, because multiple "correct" interpretations exist.

Inter-annotator agreement is the metric that matters for ML training data. It measures whether different annotators, working independently, reach the same conclusions on the same tasks. If they do not agree, your model is being trained on human disagreement — not human judgment. And it will reproduce that disagreement at scale in production.

Cohen's kappa is the standard metric for measuring inter-annotator agreement. Here is what it is, how to calculate it, how to interpret it, and why it should be a requirement on every annotation project you commission.

What Cohen's kappa actually measures

Cohen's kappa measures the agreement between two annotators beyond what would be expected by random chance. This distinction — "beyond chance" — is what makes it more informative than simple percentage agreement.

Consider a binary classification task where 90% of examples belong to Class A. If two annotators both guess Class A for everything without reading a single document, they will agree 81% of the time purely by chance. Simple percentage agreement would report 81% — which sounds high but contains zero genuine annotation quality.

Kappa removes this chance component. It answers a more precise question: of the agreement that could potentially exist beyond chance, how much agreement did these annotators actually achieve?

Cohen's kappa formula

κ = (p_o − p_e) / (1 − p_e)

Where p_o is the observed agreement (how often annotators actually agree) and p_e is the expected agreement by chance (how often they would agree if annotating randomly based on their individual label distributions).

A kappa of 0.0 means the annotators agree exactly as often as chance would predict — their labels contain no genuine signal. A kappa of 1.0 means perfect agreement on every task. A kappa of 0.70 means the annotators agree on 70% of the disagreement space beyond what chance alone would produce — a substantial level of genuine agreement.

The interpretation scale — what each range means for your model

The standard interpretation of kappa values was established by Landis and Koch (1977) and remains the most widely used framework in annotation quality assessment. Here is what each range means in practical terms for ML training data.

< 0.20

Slight agreement — effectively random

Annotators agree no more than chance predicts. The training data contains noise rather than signal. A model trained on this data learns random patterns that will not generalise. This level is typically produced by crowdsourced annotation without guidelines, calibration, or quality monitoring of any kind.

0.20–0.40

Fair agreement — unreliable for training

Annotators agree somewhat beyond chance, but significant inconsistency remains. Usually caused by vague annotation guidelines or undertrained annotators who interpret criteria differently. Inadequate for RLHF preference data where inconsistent signals teach reward models to fit noise rather than genuine preferences.

0.40–0.60

Moderate agreement — usable with caution

Acceptable for simple binary or ternary classification tasks where some subjectivity is expected. Risky for complex tasks like RLHF preference ranking, where the model needs a clear and consistent signal about what constitutes a "better" response. Many annotation providers operate in this range without disclosing it.

0.60–0.80

Substantial agreement — production-grade quality

This is the range recommended by the NLP research community for annotation data used in model training. At this level, annotators agree on the large majority of tasks, disagreements are concentrated on genuinely ambiguous cases, and the training signal is consistent enough to produce reliable model behaviour. A minimum of 0.70 is widely considered the threshold for professional annotation work.

0.80–1.00

Near-perfect agreement — treat with healthy skepticism

Achievable on well-defined, unambiguous tasks like binary image classification or named entity recognition with clear boundaries. For complex judgment tasks like RLHF preference ranking, kappa above 0.85 should prompt investigation — it may indicate the task is too easy (all pairs have an obvious winner), or that annotators are coordinating rather than working independently.

A worked example — calculating kappa step by step

Two annotators each label 100 text samples as either "positive sentiment" or "negative sentiment." Their agreement is summarised in a confusion matrix.

Worked example — sentiment annotation (100 samples)

Annotator A: Positive

Annotator A: Negative

Step 1 — Observed agreement (p_o):
They agreed on 42 + 44 = 86 out of 100 tasks → p_o = 0.86

Step 2 — Expected agreement by chance (p_e):
Annotator A labeled 50 positive, 50 negative. Annotator B labeled 48 positive, 52 negative.
P(both positive by chance) = (50/100) × (48/100) = 0.24
P(both negative by chance) = (50/100) × (52/100) = 0.26
p_e = 0.24 + 0.26 = 0.50

Step 3 — Calculate kappa:
κ = (0.86 − 0.50) / (1 − 0.50) = 0.36 / 0.50 = 0.72

Kappa = 0.72 — substantial agreement. This annotation data is production-grade quality for training a sentiment classification model. The 14 disagreements (8 + 6) should be reviewed to understand whether they represent genuine ambiguity in the text or guideline interpretation differences.

Why "98% accuracy" without kappa is meaningless

A vendor who reports "98% accuracy" on your annotation project is telling you that 98% of tasks were completed — not that 98% were annotated correctly. Or they are reporting accuracy against their own internal gold standard — which they designed, which may be too easy, and which you cannot independently verify.

The fundamental issue: accuracy is a single-annotator metric. It tells you how one annotator performed against a benchmark. It does not tell you whether multiple annotators working independently would produce the same result. For ML training data, consistency across annotators is what matters — because your model learns from the aggregate pattern, not from any single annotator's decisions.

What it measures

Single annotator vs. gold standard

Agreement between annotators beyond chance

Chance correction

No — inflated by class imbalance

Yes — removes chance agreement

Detects inconsistency

No — two annotators can both score 95% accuracy with kappa 0.40

Yes — directly measures consistency between annotators

Verifiable by client

Only if gold standard is shared

Calculated from raw annotations — client can verify independently

Industry standard

Used in marketing materials

Used in NLP research and production ML systems

The important practical point: two annotators can both have 95% individual accuracy while agreeing with each other only 60% of the time. This happens when the gold standard is easy (inflating accuracy) but the annotation criteria are ambiguous on edge cases (depressing agreement). The model trained on this data learns the 60% consistent signal and the 40% noise — and produces unpredictable behaviour on the types of inputs where the annotators disagreed.

How to calculate kappa on your own annotation data

If you are working with an annotation vendor, you should be asking for kappa on every delivery. If they cannot provide it, here is how to calculate it yourself from the raw annotations.

Python — calculating Cohen's kappa from annotation exports

from sklearn.metrics import cohen_kappa_score
import json

# Load annotations from two annotators on the same tasks
with open("annotator_1.json") as f:
    ann_1 = json.load(f)  # list of labels
with open("annotator_2.json") as f:
    ann_2 = json.load(f)

# Calculate overall kappa
kappa = cohen_kappa_score(ann_1, ann_2)
print(f"Overall kappa: {kappa:.2f}")

# Interpretation
if kappa >= 0.80:
    print("Near-perfect agreement")
elif kappa >= 0.60:
    print("Substantial agreement — production quality")
elif kappa >= 0.40:
    print("Moderate — review annotation guidelines")
else:
    print("Below threshold — recalibrate annotators")

# For RLHF: calculate per-category kappa
categories = set(task["category"] for task in tasks)
for cat in categories:
    cat_1 = [a for a, t in zip(ann_1, tasks) if t["category"] == cat]
    cat_2 = [a for a, t in zip(ann_2, tasks) if t["category"] == cat]
    cat_kappa = cohen_kappa_score(cat_1, cat_2)
    print(f"  {cat}: kappa = {cat_kappa:.2f}")

The per-category breakdown is particularly important for RLHF annotation. Overall kappa might be 0.73, but if "safety" queries have kappa 0.55 while "technical" queries have kappa 0.82, your model's alignment on safety topics is trained on significantly less consistent data than its technical capabilities. That inconsistency will show up in production.

What to do when kappa is too low

Low kappa is not a reason to discard annotators — it is a diagnostic signal that points to a specific fixable cause.

Review the annotation guidelines

The most common cause of low kappa is vague or ambiguous guidelines. If two annotators interpret "helpful" differently, they will produce different labels on the same task. The fix is more precise definitions with worked examples for every criterion, including the edge cases where the boundary is not obvious.

Run a calibration session

Have all annotators complete the same 25–30 tasks independently, then bring them together to discuss every disagreement. The discussion reveals where different annotators have different understandings of the criteria. Resolve these differences explicitly and update the guidelines before continuing live annotation.

Check for domain knowledge gaps

Low kappa on specific task categories often indicates that one or more annotators lack the domain knowledge to make consistent judgments. A generic annotator evaluating medical AI responses will have lower agreement with a medical professional than two medical professionals would have with each other. The fix is matching annotators to domains by expertise.

Examine the task design

Some tasks are genuinely ambiguous — there is no clear correct answer. If kappa remains low after guideline improvement and calibration, the task itself may need redesign. For RLHF preference ranking, this might mean adding a "both responses are equivalent" option, or splitting a multi-criterion evaluation into separate single-criterion tasks.

The questions to ask your annotation vendor

Whether you are working with an annotation vendor or managing annotation in-house, these are the kappa-related questions that should be standard on every project.

"What is the Cohen's kappa on my most recent delivered batch?" — If they cannot answer this with a specific number, they are not measuring it. This is the single most important question you can ask.

"Can you break it down by task category?" — Overall kappa masks category-level variation. Your model's weakest performance domain is almost certainly the domain with the lowest per-category kappa in the training data.

"What happens when kappa drops below your threshold?" — The answer reveals whether they have a real QA process or are just reporting post-hoc metrics. A good answer involves pausing annotation, running recalibration, and not delivering until the threshold is met again. A bad answer is "we haven't had that happen."

"Is the kappa calculated from a sample or from the full delivery?" — Kappa calculated on a 5% convenience sample can differ significantly from kappa calculated on the full dataset. Ideally, kappa is calculated on a 10–15% double-annotated sample that is representative of the full batch.

If your annotation vendor cannot tell you the Cohen's kappa on your last delivered batch, they are not measuring the one metric that determines whether your training data is signal or noise.

Kappa in the context of RLHF specifically

RLHF preference annotation introduces additional complexity because the task is inherently more subjective than classification. Two annotators can reasonably disagree about which model response is "better" — this is not always an error. The question is whether the disagreements are random or systematic.

Random disagreement on genuinely ambiguous pairs is expected and does not significantly harm the reward model. The noise averages out across thousands of pairs and the model learns the majority preference signal.

Systematic disagreement is the problem. When one annotator consistently favours longer responses while another favours conciser ones — or when one annotator is sycophancy-susceptible while another is not — the reward model learns a blended signal that does not represent any coherent preference. This produces models that behave inconsistently in production, sometimes verbose and sometimes terse, sometimes agreeable and sometimes direct, with no clear pattern.

For RLHF annotation, kappa should be interpreted alongside a per-criterion breakdown. If annotators agree on factual accuracy (kappa 0.78) but disagree on "helpfulness" (kappa 0.52), the problem is not overall quality — it is that "helpfulness" is undefined in the guidelines. Fix the definition and kappa on that criterion will rise without changing anything else.

Key takeaways

Cohen's kappa measures agreement beyond chance — the only honest quality metric for annotation data that will be used to train ML models.

A minimum kappa of 0.70 is the standard threshold for production-grade annotation. Below 0.60 indicates your model is learning from inconsistent human judgment.

"98% accuracy" without kappa is marketing, not measurement. Two annotators can both have 95% accuracy while agreeing only 60% of the time.

Per-category kappa breakdown is essential for RLHF — overall kappa masks domain-specific weaknesses that become your model's weakest performance areas.

Low kappa is fixable: improve guidelines, run calibration sessions, match annotators to domains by expertise, and redesign ambiguous tasks.

Ask your annotation vendor for kappa on every delivery. If they cannot provide it, they are not measuring the metric that matters most.

Want to see kappa scores on your annotation data?

We publish Cohen's kappa — broken down by annotator pair and task category — on every delivery as standard. Free model audit available: 50 outputs, 5 days, no cost.

Request Free Audit →

The Concave AI Team

ML-Engineer-Led Data Annotation & GenAI Evaluation · Bengaluru, India

Cohen's kappa explained for ML engineers