Every data annotation vendor claims high accuracy. The numbers range from 95% to 99%. They appear in pitch decks, on websites, and in contract appendices. And in most cases, they are meaningless.
The reason is straightforward. Accuracy measures whether an annotator completed a task correctly, but it does not measure whether two annotators would have made the same decision on the same task. On simple, unambiguous tasks — "is this image a cat or a dog?" — high accuracy and high agreement go together. On complex tasks — "which of these two model responses is more helpful, accurate, and safe?" — accuracy can be high while agreement is low, because multiple "correct" interpretations exist.
Inter-annotator agreement is the metric that matters for ML training data. It measures whether different annotators, working independently, reach the same conclusions on the same tasks. If they do not agree, your model is being trained on human disagreement — not human judgment. And it will reproduce that disagreement at scale in production.
Cohen's kappa is the standard metric for measuring inter-annotator agreement. Here is what it is, how to calculate it, how to interpret it, and why it should be a requirement on every annotation project you commission.
What Cohen's kappa actually measures
Cohen's kappa measures the agreement between two annotators beyond what would be expected by random chance. This distinction — "beyond chance" — is what makes it more informative than simple percentage agreement.
Consider a binary classification task where 90% of examples belong to Class A. If two annotators both guess Class A for everything without reading a single document, they will agree 81% of the time purely by chance. Simple percentage agreement would report 81% — which sounds high but contains zero genuine annotation quality.
Kappa removes this chance component. It answers a more precise question: of the agreement that could potentially exist beyond chance, how much agreement did these annotators actually achieve?
A kappa of 0.0 means the annotators agree exactly as often as chance would predict — their labels contain no genuine signal. A kappa of 1.0 means perfect agreement on every task. A kappa of 0.70 means the annotators agree on 70% of the disagreement space beyond what chance alone would produce — a substantial level of genuine agreement.
The interpretation scale — what each range means for your model
The standard interpretation of kappa values was established by Landis and Koch (1977) and remains the most widely used framework in annotation quality assessment. Here is what each range means in practical terms for ML training data.
A worked example — calculating kappa step by step
Two annotators each label 100 text samples as either "positive sentiment" or "negative sentiment." Their agreement is summarised in a confusion matrix.
They agreed on 42 + 44 = 86 out of 100 tasks → po = 0.86
Step 2 — Expected agreement by chance (pe):
Annotator A labeled 50 positive, 50 negative. Annotator B labeled 48 positive, 52 negative.
P(both positive by chance) = (50/100) × (48/100) = 0.24
P(both negative by chance) = (50/100) × (52/100) = 0.26
pe = 0.24 + 0.26 = 0.50
Step 3 — Calculate kappa:
κ = (0.86 − 0.50) / (1 − 0.50) = 0.36 / 0.50 = 0.72
Why "98% accuracy" without kappa is meaningless
A vendor who reports "98% accuracy" on your annotation project is telling you that 98% of tasks were completed — not that 98% were annotated correctly. Or they are reporting accuracy against their own internal gold standard — which they designed, which may be too easy, and which you cannot independently verify.
The fundamental issue: accuracy is a single-annotator metric. It tells you how one annotator performed against a benchmark. It does not tell you whether multiple annotators working independently would produce the same result. For ML training data, consistency across annotators is what matters — because your model learns from the aggregate pattern, not from any single annotator's decisions.
The important practical point: two annotators can both have 95% individual accuracy while agreeing with each other only 60% of the time. This happens when the gold standard is easy (inflating accuracy) but the annotation criteria are ambiguous on edge cases (depressing agreement). The model trained on this data learns the 60% consistent signal and the 40% noise — and produces unpredictable behaviour on the types of inputs where the annotators disagreed.
How to calculate kappa on your own annotation data
If you are working with an annotation vendor, you should be asking for kappa on every delivery. If they cannot provide it, here is how to calculate it yourself from the raw annotations.
from sklearn.metrics import cohen_kappa_score import json # Load annotations from two annotators on the same tasks with open("annotator_1.json") as f: ann_1 = json.load(f) # list of labels with open("annotator_2.json") as f: ann_2 = json.load(f) # Calculate overall kappa kappa = cohen_kappa_score(ann_1, ann_2) print(f"Overall kappa: {kappa:.2f}") # Interpretation if kappa >= 0.80: print("Near-perfect agreement") elif kappa >= 0.60: print("Substantial agreement — production quality") elif kappa >= 0.40: print("Moderate — review annotation guidelines") else: print("Below threshold — recalibrate annotators") # For RLHF: calculate per-category kappa categories = set(task["category"] for task in tasks) for cat in categories: cat_1 = [a for a, t in zip(ann_1, tasks) if t["category"] == cat] cat_2 = [a for a, t in zip(ann_2, tasks) if t["category"] == cat] cat_kappa = cohen_kappa_score(cat_1, cat_2) print(f" {cat}: kappa = {cat_kappa:.2f}")
The per-category breakdown is particularly important for RLHF annotation. Overall kappa might be 0.73, but if "safety" queries have kappa 0.55 while "technical" queries have kappa 0.82, your model's alignment on safety topics is trained on significantly less consistent data than its technical capabilities. That inconsistency will show up in production.
What to do when kappa is too low
Low kappa is not a reason to discard annotators — it is a diagnostic signal that points to a specific fixable cause.
Review the annotation guidelines
The most common cause of low kappa is vague or ambiguous guidelines. If two annotators interpret "helpful" differently, they will produce different labels on the same task. The fix is more precise definitions with worked examples for every criterion, including the edge cases where the boundary is not obvious.
Run a calibration session
Have all annotators complete the same 25–30 tasks independently, then bring them together to discuss every disagreement. The discussion reveals where different annotators have different understandings of the criteria. Resolve these differences explicitly and update the guidelines before continuing live annotation.
Check for domain knowledge gaps
Low kappa on specific task categories often indicates that one or more annotators lack the domain knowledge to make consistent judgments. A generic annotator evaluating medical AI responses will have lower agreement with a medical professional than two medical professionals would have with each other. The fix is matching annotators to domains by expertise.
Examine the task design
Some tasks are genuinely ambiguous — there is no clear correct answer. If kappa remains low after guideline improvement and calibration, the task itself may need redesign. For RLHF preference ranking, this might mean adding a "both responses are equivalent" option, or splitting a multi-criterion evaluation into separate single-criterion tasks.
The questions to ask your annotation vendor
Whether you are working with an annotation vendor or managing annotation in-house, these are the kappa-related questions that should be standard on every project.
"What is the Cohen's kappa on my most recent delivered batch?" — If they cannot answer this with a specific number, they are not measuring it. This is the single most important question you can ask.
"Can you break it down by task category?" — Overall kappa masks category-level variation. Your model's weakest performance domain is almost certainly the domain with the lowest per-category kappa in the training data.
"What happens when kappa drops below your threshold?" — The answer reveals whether they have a real QA process or are just reporting post-hoc metrics. A good answer involves pausing annotation, running recalibration, and not delivering until the threshold is met again. A bad answer is "we haven't had that happen."
"Is the kappa calculated from a sample or from the full delivery?" — Kappa calculated on a 5% convenience sample can differ significantly from kappa calculated on the full dataset. Ideally, kappa is calculated on a 10–15% double-annotated sample that is representative of the full batch.
Kappa in the context of RLHF specifically
RLHF preference annotation introduces additional complexity because the task is inherently more subjective than classification. Two annotators can reasonably disagree about which model response is "better" — this is not always an error. The question is whether the disagreements are random or systematic.
Random disagreement on genuinely ambiguous pairs is expected and does not significantly harm the reward model. The noise averages out across thousands of pairs and the model learns the majority preference signal.
Systematic disagreement is the problem. When one annotator consistently favours longer responses while another favours conciser ones — or when one annotator is sycophancy-susceptible while another is not — the reward model learns a blended signal that does not represent any coherent preference. This produces models that behave inconsistently in production, sometimes verbose and sometimes terse, sometimes agreeable and sometimes direct, with no clear pattern.
For RLHF annotation, kappa should be interpreted alongside a per-criterion breakdown. If annotators agree on factual accuracy (kappa 0.78) but disagree on "helpfulness" (kappa 0.52), the problem is not overall quality — it is that "helpfulness" is undefined in the guidelines. Fix the definition and kappa on that criterion will rise without changing anything else.