How It Works — Concave AI

Why our process is different

Not a black box.
A documented pipeline.

Most annotation companies take your data, return a file, and leave. You have no visibility into how annotators were selected, whether they were calibrated, how disputes were resolved, or whether the resulting data is going to improve your model or just add noise to it.

Every step of our pipeline generates a log entry. Every batch is tracked through three tiers of quality assurance. The final delivery includes not just the annotated dataset but a QA report with inter-annotator agreement scores, gold standard pass rates, and a data card documenting exactly how the data was produced.

"The only quality metric that actually matters is whether your model improved after training on our data. Everything else — kappa scores, gold accuracy, error rates — are leading indicators. We measure all of them, and we follow up on the one that counts."

Two weeks after every delivery, we follow up and ask for your benchmark result. If the data produced the expected improvement, we document it as your case study. If it did not, we investigate and re-deliver at no cost.

Typical project — 2,000 RLHF pairs

Days 1–3Scoping call · SOW signed · NDA executed

Days 3–7Guidelines written · Annotators calibrated to κ ≥ 0.70

Days 7–10Data ingested · S3 encrypted · Gold tasks injected

Days 10+RLAIF pre-scores → human annotation running

Day N–33-tier QA · Export validation · Report generation

Day NDataset + QA report + data card delivered ✓

+14 daysModel benchmark follow-up call scheduled

For 2,000 RLHF pairs: typically 10–16 working days from signed SOW to delivery. Timeline is included in every Scope of Work document before work begins.

Step by Step

Six steps. Zero shortcuts.

Every project runs through the same pipeline regardless of size. The steps are non-negotiable — skipping any one of them produces the data quality failures we exist to prevent.

1

Days 1–3

Discovery & Scoping

You share your data samples. We ask the questions that actually determine quality: What does your ML training pipeline need as output format? What quality threshold constitutes success? What compliance requirements apply? What is the specific failure mode your current data is producing? The Scope of Work we produce is precise — not a generic template — because every project has different requirements and different quality risks.

📋

Scope of Work

Exact deliverables, output format, volume, timeline, quality thresholds, and pricing. Signed before any data moves.

🔏

Mutual NDA

Signed before any data exchange. Covers annotators individually as well. Attorney-reviewed template.

🎯

Quality target set

We define the specific kappa threshold, gold standard accuracy, and model benchmark improvement target for your project.

💡

The most important question we ask: "How will you measure quality after you train on our data?" Your answer defines our QA targets — and the benchmark we follow up on two weeks after delivery.

2

Days 3–7

Annotation Guidelines

The annotation guidelines document is the single most important quality control mechanism in our operation. Vague guidelines produce inconsistent data regardless of how experienced your annotators are. We write 8–15 page project-specific guidelines covering: evaluation criteria with precise definitions, 20–30 worked examples including the hard cases, edge case rules with documented decisions, and explicit escalation procedures for tasks that need review.

✏️

Project-specific writing

Never a generic template reused across clients. Written for your domain, your model, your cultural context.

🧪

Tested on non-experts

We validate every guidelines document by giving it to someone with no annotation background and checking if they can produce correct annotations.

🌐

Indic cultural context

For Indian language projects: cultural nuance notes per language. How directness, politeness, and agreement norms differ across Hindi, Tamil, Telugu, Kannada.

20–30 worked examples Edge case taxonomy Escalation protocol Calibration task set (25 tasks) Sycophancy trap section (RLHF)

3

Days 5–10

Annotator Selection & Calibration

From our vetted pool, we select annotators matching three criteria: native fluency in the target language, appropriate education and experience level, and relevant domain expertise for the specific task. Then we run a calibration session — all selected annotators complete 20–30 tasks independently, then discuss every disagreement together. We do not begin live annotation until Cohen's kappa among the team is above 0.70. Anyone consistently below 0.65 on calibration is replaced.

👨‍⚕️

Domain experts, not crowds

Clinicians for healthcare. Lawyers for legal. CAs for finance. Software engineers for code RLHF. Expert judgment from day one.

📊

Calibration kappa baseline

Each annotator's kappa score on the calibration task set is documented and included in your data card. You know the baseline before annotation starts.

🔒

Individual NDAs signed

Every annotator signs a confidentiality agreement specific to your project before accessing any task. Named-access only — no anonymous crowd.

🌏

For Indic language projects: annotator selection matches language and cultural zone. A Hindi annotator from UP and one from Bihar have distinct linguistic registers — we account for this in calibration. 8 Indic languages covered with native-speaker specialist pools.

4

Days 8+ (Bulk of project)

RLAIF Pre-Annotation + Expert Review

This is where our pipeline outperforms pure-human competitors. The RLAIF pre-scorer (Claude API) evaluates each task and pre-populates a suggested answer with reasoning before any annotator sees it. For image tasks, SAM2 pre-draws segmentation boundaries. Annotators then validate, correct, or override the AI suggestion — they never start from scratch. AI handles the 70–90% of tasks with clear correct answers. Expert humans handle the 10–30% requiring genuine judgment.

🤖

RLAIF pre-scorer (text)

Claude API evaluates RLHF pairs and NLP tasks. Pre-scores 70–90% of tasks. Humans validate and override. 60% speed advantage over pure-manual.

🖼️

SAM2 pre-annotation (image)

Meta's Segment Anything Model pre-draws segmentation boundaries. Annotators correct boundaries and assign labels. 50% image annotation time reduction.

🎬

ByteTrack (video)

Multi-object tracking maintains consistent IDs across frames. Temporal interpolation between keyframes. 70% video annotation labor reduction.

⚡

Gold standard injection runs throughout: We inject pre-annotated test tasks at a 6% rate — randomly distributed, invisible to annotators. Their accuracy on these tasks is the real-time quality signal. Below 80% gold accuracy triggers review. Below 70% triggers reassignment.

5

Concurrent throughout project

Three-Tier Quality Assurance

Quality assurance runs concurrently with annotation — not as a final check before delivery. Every evening, automated scripts analyse the day's annotation. Every week, a 10–15% double-annotation sample runs through peer review. The senior ML engineer audits all flagged tasks plus a 5% random sample. If kappa drops below 0.65 on any batch, annotation pauses and recalibration runs. The pipeline cannot proceed with degraded quality.

📊

Tier 1 — Automated

Nightly Python scripts: response bias detection, speed anomalies, gold standard failures, reasoning quality checks. Catches 60% of quality problems at zero human cost.

🔄

Tier 2 — Peer Review

10–15% double-annotation sample. Second annotator re-does tasks blind. Kappa recomputed. Below 0.65 triggers recalibration. Catches systematic biases.

🔬

Tier 3 — Expert Audit

ML engineer reviews all flags + 5% random sample. Catches sycophancy in RLHF data, cultural context errors, training signal quality issues.

6

Final delivery + Day 14 follow-up

Delivery & Feedback Loop

Data is exported and converted to your required format (JSONL, CoNLL, COCO, or custom). Validation scripts check every field for completeness and encoding. Delivery includes three documents: the annotated dataset, a QA report with kappa scores and gold accuracy, and a data card documenting how the data was produced. Two weeks later — the step every annotation company skips — we follow up and ask for your benchmark result.

JSONL (RLHF) CoNLL-2003 (NLP) COCO format (image) Custom format on request QA report PDF Data card included Day 14 benchmark call

The RLAIF + Human Model

How work is split on a typical RLHF project

Traditional annotation companies use humans for 100% of work. AI labeling companies use AI for 100% of work. We use each for what it is genuinely better at.

AI is better at: consistent application of clear rules at speed across large volumes. Humans are better at: nuanced judgment, cultural context, domain expertise, and catching the subtle failures that AI systems produce — sycophancy, hallucination, cultural misrepresentation.

The RLAIF pre-scorer handles the 70–90% of tasks with clear correct answers. Expert human annotators handle the 10–30% requiring genuine judgment. The QA layer runs on everything. The result: AI speed at human quality, at 35–45% lower cost than pure-human competitors.

Result: 60% faster than pure-human annotation at the same Cohen's kappa score — because humans still review every annotation before delivery. AI handles volume. Humans handle judgment. You pay only for what requires human expertise.

Work distribution — RLHF project

RLAIF pre-scorer — clear preference tasks 72%

Human experts — uncertain + edge cases 22%

3-tier QA review (concurrent) 6%

Live pipeline

📥Customer data → encrypted S3 bucket

↓

🤖RLAIF evaluates all tasks · SAM2 pre-labels images

↓

👤Experts validate · correct · annotate edge cases

↓

📊3-tier QA runs concurrently throughout

↓

📦Dataset + QA report + data card delivered

Quality Architecture

Three tiers on every batch

Most annotation companies run one quality pass. We run three independently throughout the project. Each tier catches failure modes the others miss.

1

Tier 1 · Automated · Nightly

Statistical anomaly detection

Python scripts run every evening on all completed annotation. Flags suspicious patterns before any human reviewer looks at the data. Catches approximately 60% of quality problems at zero human cost.

Response bias — always choosing option A

Speed anomaly — 3× faster than median, not reading

Selection uniformity — variance too low across choices

Gold standard accuracy below 80% threshold

Reasoning field under 15 characters — placeholder responses

2

Tier 2 · Peer Review · 10–15% sample

Double-annotation + kappa recompute

A second annotator re-does 10–15% of each annotator's tasks independently, blind to the first annotator's choices. All disagreements are adjudicated. Kappa recomputed. Below 0.65 triggers recalibration.

Category-level bias — disagreement in specific task types

Guideline misinterpretation patterns across a batch

Cultural context errors on Indic language tasks

Label boundary drift on NLP entity annotation

Domain knowledge gaps in specialist annotation

3

Tier 3 · Expert Audit · 5% + all flags

ML-engineer personal review

Senior ML engineer reviews every flag from Tiers 1 and 2, every gold standard failure, and a 5% random sample from across the full batch. Every decision is documented — improves guidelines for next project.

Sycophancy in RLHF preference data — most damaging single error

Training signal quality — will this data improve the model?

Hallucination detection edge cases

Cultural nuance errors in Indic language tasks

Guideline gaps that need updating before next batch

What You Receive

Data, report, and data card

Most annotation companies deliver a data file and an invoice. We deliver three things on every project — because the QA report and data card are what turn annotated data into auditable training infrastructure.

The annotated dataset is delivered in your exact required format — JSONL for RLHF, CoNLL-2003 for NLP, COCO for images. Validation scripts check every field for completeness and encoding. If anything is malformed, it is caught before delivery, not after you try to load it into your training pipeline.

The QA report contains the actual numbers: Cohen's kappa by annotator pair and by task category, gold standard accuracy per annotator, batch error rates, number of tasks flagged and resolved, and a summary of any recalibration events that occurred during the project. These numbers are computed from Label Studio export data by automated scripts — not estimated, not reported by annotators themselves.

The data card is the nutritional label for your training data. It documents the composition of the dataset, the annotator demographics, the known limitations (if one category had lower agreement than others, we tell you which one and why), and the guidelines version used. It is what you show investors during due diligence, auditors during compliance review, and your own ML team when they ask how the training data was produced.

Two weeks after delivery, we schedule a 30-minute model benchmark follow-up call. If the data produced the expected improvement, we document it as your case study — evidence that no number we report on delivery can match.

Data Card — Project #CA-2026-047

Hindi RLHF Preference Data · Concave AI · 14 April 2026

Dataset Statistics

Preference pairs delivered2,000

LanguageHindi — Devanagari

Domain splitTechnical 40% · General 35% · Safety 25%

Output formatJSONL · 8 fields per record

Quality Metrics

Cohen's kappa overall0.74

Technical tasks kappa0.77

Gold standard accuracy91%

Flagged + resolved pairs47 of 2,000 (2.35%)

3-tier QA appliedYes — all tiers

Annotator Profile

Annotator count8

Native Hindi speakers8 / 8

Pre-project kappa baseline0.72 — all annotators

Known Limitations

Lower-agreement categoryAbstract reasoning κ = 0.67

Guidelines versionHindi RLHF v2.3

The loop that keeps improving your model

Most annotation is treated as a one-time event. You pay for data, train your model, and whatever happens next is your problem. We build a continuous improvement loop into every engagement.

After three or four projects with us, our guidelines are tuned to exactly what your model needs, the annotator team knows your product context, and the data we produce is measurably better than what any new vendor could deliver on their first project.

Start the loop →

📦Data delivered + QA report sent to your team

↓

🧠You train your model on our annotated data

↓

📈You measure your benchmark improvement

↓

📞Day 14 call — you share the result with us

↓

⚙️We update guidelines based on what we find

↓

✓Next batch is better than the last — every time

Precision, not
promises.
At every step.

Not a black box.
A documented pipeline.

Six steps. Zero shortcuts.

Discovery & Scoping

Annotation Guidelines

Annotator Selection & Calibration

RLAIF Pre-Annotation + Expert Review

Three-Tier Quality Assurance

Delivery & Feedback Loop

How work is split on a typical RLHF project

Three tiers on every batch

Data, report, and data card

The loop that keeps improving your model

Ready to run your first project?

Precision, notpromises.At every step.

Not a black box.A documented pipeline.

Six steps. Zero shortcuts.

Discovery & Scoping

Annotation Guidelines

Annotator Selection & Calibration

RLAIF Pre-Annotation + Expert Review

Three-Tier Quality Assurance

Delivery & Feedback Loop

How work is split on a typical RLHF project

Three tiers on every batch

Data, report, and data card

The loop that keeps improving your model

Ready to run your first project?

Precision, not
promises.
At every step.

Not a black box.
A documented pipeline.