When two annotators independently agree that Response A is better than Response B, that agreement has a statistical weight. Cohen's kappa measures genuine agreement beyond what random chance alone would predict. A kappa of 0.0 means two annotators agree exactly as often as random clicking would. A kappa of 1.0 means perfect agreement on every task.
This is the gold standard for annotation quality measurement in academic NLP research and production AI systems worldwide. Almost no Indian annotation company publishes it — because publishing requires actually measuring it consistently, which requires automated QA systems most vendors simply do not have.
On every Concave AI project, you receive a kappa breakdown by annotator pair, by task category, and as an overall project average. These numbers are computed from Label Studio export data by an automated Python script — not estimated or reported by the annotators themselves. If our kappa drops below 0.70 on any batch, work pauses and we recalibrate before continuing. We have never delivered below our threshold.
What kappa below 0.60 means for your model: inconsistent preference data teaches your reward model to fit noise rather than signal. The model learns a pattern that reflects annotator disagreement, not human preference. This is how sycophancy, reward hacking, and unstable model behaviour emerge from training pipelines — not from the training algorithm, but from the annotation data fed into it.