Quality Standards

Numbers, not claims.
Verify everything yourself.

Every competitor claims "98% accuracy." We publish Cohen's kappa scores, gold standard pass rates, and batch error logs on every single delivery. Here is exactly what those numbers mean, how we guarantee them, and what happens if we fall below threshold.

Request Free Audit → See QA Architecture

≥ 0.72

Cohen's kappa inter-annotator agreement guaranteed on every delivered batch

≥ 88%

Gold standard task accuracy across all annotators on each project

Independent QA tiers applied to every annotation batch before delivery

100%

Of deliveries include a full QA report and data card — standard, always

Scroll

Inter-Annotator Agreement

Why Cohen's kappa is the only honest quality metric

When two annotators independently agree that Response A is better than Response B, that agreement has a statistical weight. Cohen's kappa measures genuine agreement beyond what random chance alone would predict. A kappa of 0.0 means two annotators agree exactly as often as random clicking would. A kappa of 1.0 means perfect agreement on every task.

This is the gold standard for annotation quality measurement in academic NLP research and production AI systems worldwide. Almost no Indian annotation company publishes it — because publishing requires actually measuring it consistently, which requires automated QA systems most vendors simply do not have.

"When a vendor says '98% accuracy' without specifying their inter-annotator agreement methodology, they are telling you the number they want you to believe, not the number they measured."

On every Concave AI project, you receive a kappa breakdown by annotator pair, by task category, and as an overall project average. These numbers are computed from Label Studio export data by an automated Python script — not estimated or reported by the annotators themselves. If our kappa drops below 0.70 on any batch, work pauses and we recalibrate before continuing. We have never delivered below our threshold.

What kappa below 0.60 means for your model: inconsistent preference data teaches your reward model to fit noise rather than signal. The model learns a pattern that reflects annotator disagreement, not human preference. This is how sycophancy, reward hacking, and unstable model behaviour emerge from training pipelines — not from the training algorithm, but from the annotation data fed into it.

< 0.20

Slight — effectively random

Annotators agree no more than chance would predict. Your model learns noise disguised as signal. Produced by untrained crowds on ambiguous tasks with no guidelines. Useless for RLHF.

0.20–0.40

Fair — poor guideline compliance

Significant disagreement remains. Caused by vague annotation guidelines, undertrained annotators, or domain mismatches. Inadequate for RLHF or GenAI alignment data.

0.40–0.60

Moderate — risky for alignment

Acceptable for simple binary classification. Risky for RLHF preference ranking, where inconsistency teaches reward models to fit human disagreement rather than human preference. Many Indian vendors operate here.

0.60–0.80

★ Substantial — our minimum threshold (we guarantee ≥ 0.70)

Production-grade annotation quality. The standard used by academic NLP research and recommended for RLHF, SFT, and GenAI evaluation data. Typical Concave AI deliveries: 0.72–0.79. Below 0.70 triggers recalibration.

0.80–1.00

Near-perfect — treat with skepticism

Achievable only on simple, unambiguous, well-defined tasks. Any vendor claiming 0.85+ kappa on complex judgment tasks like RLHF preference ranking is almost certainly aggregating across easy tasks to inflate the headline number. Ask for the per-category breakdown.

QA Architecture

Three tiers, run on every single batch

Most annotation companies run one quality pass at the end. We run three independently throughout the project. Each tier is designed to catch failure modes the others would miss at the stage where intervention is cheapest.

Tier 1 — Automated · Runs nightly

Statistical anomaly detection

Python scripts run every evening on all annotation completed that day. They check every annotator's behavioural patterns against four statistical rules and flag violations before any human reviewer looks at the data. Catches approximately 60% of quality problems at zero human cost and in time to fix them before the next working day.

Response bias: Annotator chooses option A more than 68% of the time — suggests position bias, not genuine preference evaluation

Speed anomaly: Annotator completes tasks in less than 30% of the median time — suggests skimming without reading

Uniformity anomaly: Response variance too low across a batch — suggests mechanical clicking without evaluation

Gold standard failure: Gold task accuracy below 80% — direct evidence of quality drift

Reasoning quality: Required reasoning text field under 15 characters — evidence of placeholder responses

Tier 2 — Peer Review · 10–15% sample

Double-annotation + kappa recompute

A second annotator re-does 10–15% of each annotator's completed tasks independently, blind to the first annotator's choices. All disagreements are logged and adjudicated by the project lead. Cohen's kappa is recomputed after each batch completes. If kappa falls below 0.65 on any batch, annotation pauses and a recalibration session runs before work continues.

Category-level bias: An annotator consistently disagrees on "safety" tasks but agrees on "helpfulness" — domain knowledge gap, not general quality failure

Guideline misinterpretation: Two annotators have different understandings of a criterion — triggers guideline clarification, not annotator removal

Cultural context errors: Indic language tasks where annotators apply inappropriate Western communication norms

Label boundary drift: For NLP tasks, entity span boundaries gradually widening or narrowing across a long batch

Tier 3 — Expert Audit · 5% + all flags

ML-engineer personal review

Our senior ML engineer personally reviews every task flagged by Tiers 1 and 2, every gold standard failure regardless of the overall pass rate, and a random 5% sample drawn from across the full batch. For Indic language tasks, cultural context is verified by a native speaker with relevant domain knowledge. Every expert decision is documented with full reasoning.

Sycophancy in RLHF data: Preference pairs where the "winning" response validates a false premise — the most damaging single annotation errors

Training signal quality: Assessing whether the data will produce the desired model behaviour — not just whether it is correctly labeled

Hallucination detection edge cases: Claims that are technically accurate but misleadingly incomplete

Guideline gaps: Recurring edge cases the guidelines don't cover — triggers a guidelines update before the next batch begins

Gold Standard Tasks

Your real-time quality signal, running continuously

Gold standard tasks are the most reliable quality monitoring mechanism in professional annotation. They catch quality drift within the first 20–30 tasks — not after 2,000 have been submitted.

Before any live project begins, we annotate 50–100 tasks ourselves with high confidence — tasks where the correct answer is unambiguous and the reasoning is well-documented. These become our test questions. We inject them at a 6% rate, distributed randomly throughout every annotation batch.

Annotators do not know which tasks are gold standard. They annotate everything identically. Because the gold tasks are visually indistinguishable from live tasks, an annotator who would perform well on obvious test questions but poorly on real work cannot game this system.

After each batch completes, an automated script checks every annotator's gold task answers. Below 80% accuracy — the annotator is flagged and their recent work is reviewed. Below 70% accuracy — tasks are reassigned and the annotator is removed from the project. This happens before any human QA reviewer sees the data.

Gold injection rate across every batch — randomised, not clustered

80%

Minimum gold accuracy required to continue — below this triggers review

50+

Gold tasks in our library per project type — rotated to prevent memorisation

Auto

Evaluated by script — not manually, removing human judgement from the check

Annotator's task queue — gold injected at positions 004, 008, 011...

001RLHF pair: Python data processing queryLive

002RLHF pair: Medical symptom informationLive

003RLHF pair: Creative writing feedbackLive

004RLHF pair: Legal question — sycophancy trapGold ★

005RLHF pair: Financial calculation explanationLive

006RLHF pair: Code debugging walkthroughLive

007RLHF pair: Science concept explanationLive

008NLP entity task: Business news paragraphGold ★

009RLHF pair: Customer service scenarioLive

010RLHF pair: Ethical dilemma responseLive

011RLHF pair: Hallucination detection taskGold ★

The Full Pipeline

Quality enforced at every step, not just the end

Quality is not a final check before delivery. It is enforced at every step of the pipeline — from the moment customer data arrives on our servers to the model benchmark follow-up two weeks after delivery. Each step in the pipeline generates a log entry included in your delivery documentation.

The RLAIF pre-scorer (Claude API) reduces human error by flagging uncertain cases for extra human scrutiny — the cases that would most likely produce annotator disagreement are the ones that get the most careful human attention. Gold standard injection catches quality drift in real time. Three-tier QA runs concurrently with annotation so problems are caught mid-project, not post-delivery.

The model benchmark follow-up on Day 14 is the most important quality step — and the one most annotation companies skip entirely. It is the only quality signal that actually measures what matters: did the data improve the model? Everything else — kappa scores, gold accuracy, batch error rates — are leading indicators. Model improvement is the actual outcome you paid for.

Project quality pipeline — every step logged

📥Customer data to isolated, encrypted S3 bucketAES-256

↓

📝Project-specific guidelines written + annotators calibrated to κ ≥ 0.70Calibrated

↓

🥇Gold standard tasks injected at 6% rate — randomised positionsGold In

↓

🤖RLAIF pre-scorer evaluates all tasks — clear cases pre-labeledAI Pre-label

↓

👤Domain-expert annotators validate, correct, and annotate edge casesHuman

↓

📊Tier 1: Automated anomaly detection runs nightly on all batchesAuto-QA

↓

🔄Tier 2: Peer double-annotation on 10–15% sample · kappa recomputedPeer QA

↓

🔍Tier 3: ML-engineer expert audit — 5% random + all flags reviewedExpert QA

↓

📦Dataset + QA Report + Data Card delivered via encrypted transferDelivered

↓

📈Day 14 model benchmark follow-up · result used to improve next batchFeedback Loop

The Data Card

Every delivery comes with a full data card

A data card is a structured metadata document that describes the dataset you receive — its composition, the quality metrics achieved, the annotator demographics, the known limitations, and the guidelines version used. Think of it as the nutritional label for your training data.

Most annotation providers deliver a data file and an invoice. We deliver a data file, a QA report, and a data card — on every project, as standard. The data card is what your ML team uses to understand exactly what they trained on and why the model behaves the way it does.

It is also what you show investors during due diligence, enterprise customers during procurement, and auditors when they ask how your AI system was trained. The "Known Limitations" section is deliberately honest — if one task category had lower agreement than others, we tell you exactly which one and why. Hiding limitations helps no one; disclosing them lets you factor them into your training strategy.

Data Card — Project #CA-2026-047

Hindi RLHF Preference Data · Concave AI · Delivered 14 April 2026

Dataset Statistics

Total preference pairs2,000

LanguageHindi — Devanagari script

Domain distributionTechnical 40% · General 35% · Safety 25%

Reasoning field includedYes — all 2,000 pairs

Quality Metrics

Cohen's kappa — overall0.74

Cohen's kappa — technical tasks0.77

Cohen's kappa — safety tasks0.71

Gold standard accuracy91%

Flagged and resolved pairs47 of 2,000 (2.35%)

3-tier QA appliedAll tiers · full documentation

Annotator Profile

Annotators on project8

Native Hindi speakers8 / 8

Calibration kappa (pre-project)0.72 — all 8 annotators

Known Limitations

Lower agreement categoryAbstract reasoning — κ = 0.67

ReasonEvaluating reasoning quality is inherently more subjective — disclosed, not hidden

Guidelines versionHindi RLHF v2.3 · 14 Apr 2026

Sycophancy Detection

The quality failure nobody measures — until the model is in production

Sycophancy is what happens when annotators — often without realising it — reward responses that agree with the user over responses that are accurate. The model learns this pattern from the preference data and produces outputs that validate beliefs rather than provide truth.

This is a data-level problem. By the time it appears in production model behaviour, the damage is already in the training data — and no amount of post-training adjustment will fully correct it. The only fix is to measure it at the annotation stage and correct it before training begins.

We inject sycophancy trap tasks — cases where the factually correct response disagrees with the user's premise — at a 5–8% rate into every RLHF project. We measure what percentage of annotators choose the agreeable-but-wrong response. We report that number in your QA report. And we weight sycophancy-trap tasks more heavily in the training data to counteract the bias.

Industry benchmark: In uncontrolled annotation environments, 30–40% of annotators fail sycophancy trap tasks. After calibration and sycophancy-specific guideline training, our projects typically reduce this to 8–12%.

Sycophancy trap example — RLHF pair

User asks: "Python is slower than Java for all data processing tasks, right? I've heard this from multiple engineers."

Response A — agreeable, factually wrong "You're correct — Java consistently outperforms Python in data processing, especially at enterprise scale. This is well-established in the industry."

Response B — accurate, mildly disagrees with user "This is a common belief but not accurate as a blanket statement. Python with NumPy and Pandas typically outperforms Java for scientific and analytical data tasks. Java may be faster for some concurrent processing scenarios, but the premise as stated is incorrect."

The correct preference is Response B. An annotator who chooses A — because it "sounds more certain" or "agrees with the user" — is teaching the model to validate incorrect beliefs. This is how sycophancy enters a training pipeline.

34%

Industry average: annotators who fail sycophancy traps in uncalibrated environments

Concave AI average after calibration + sycophancy-specific guideline training

Compliance & Security

Data security is not a feature — it's the minimum

Every project begins with a mutual NDA. Every annotator signs a confidentiality agreement. Customer data is encrypted at rest and in transit. No anonymous crowd access, ever.

🇮🇳

DPDP Act 2023

Full compliance with India's Digital Personal Data Protection Act. Data handling, storage controls, access logging, deletion policies, and consent frameworks aligned with Indian law. The most important compliance standard for any Indian AI company working with user data.

🔐

AWS Encrypted Storage

All customer data stored in isolated S3 buckets — one bucket per customer, never shared. AES-256 server-side encryption enabled on all objects. File access via time-limited signed URLs with 24-hour expiry. No data transmitted unencrypted at any point.

📋

Signed NDA — Every Project

Mutual NDA signed before any data exchange. Individual annotators sign separate confidentiality agreements before accessing any project. Only named individuals listed in the project data card have access — no anonymous crowd access, no shared credentials.

🇪🇺

GDPR Ready

For international customers with European data subjects. Data minimisation, consent tracking, right-to-erasure workflows, and lawful basis documentation available on request. Data Processing Agreement provided for all enterprise engagements.

🏥

HIPAA-Aligned

For healthcare AI customers handling protected health information. Secure de-identification workflows, per-session access logs, and Business Associate Agreement available for medical imaging annotation and clinical NLP projects.

📊

ISO 27001 In Progress

All current security controls align with ISO 27001 information security management requirements. Formal certification process underway — expected Q3 2026. Current security posture documentation available to enterprise customers on request.

Numbers, not claims.
Verify everything yourself.

Why Cohen's kappa is the only honest quality metric

Metrics you can verify yourself

Three tiers, run on every single batch

Your real-time quality signal, running continuously

Quality enforced at every step, not just the end

Every delivery comes with a full data card

The quality failure nobody measures — until the model is in production

Data security is not a feature — it's the minimum

See our standards on your own data

Numbers, not claims.Verify everything yourself.

Why Cohen's kappa is the only honest quality metric

Metrics you can verify yourself

Three tiers, run on every single batch

Your real-time quality signal, running continuously

Quality enforced at every step, not just the end

Every delivery comes with a full data card

The quality failure nobody measures — until the model is in production

Data security is not a feature — it's the minimum

See our standards on your own data

Numbers, not claims.
Verify everything yourself.