RLHF Preference Data

What It Is

The feedback signal that makes your model prefer good answers

RLHF (Reinforcement Learning from Human Feedback) is how today's most capable AI assistants — ChatGPT, Claude, Gemini — learned to be genuinely helpful rather than just statistically plausible. The core of RLHF is preference data: pairs of AI responses where a human expert judges which one is better and why.

When your language model generates two different answers to the same question, a human annotator compares them across multiple dimensions: helpfulness, factual accuracy, safety, tone, and cultural appropriateness. This judgment — and the structured reasoning behind it — is fed into a reward model that learns to predict what good responses look like. That reward signal then shapes your LLM's policy during reinforcement learning.

The quality of your preference data is the single biggest lever you have over your model's alignment. Low-quality data — where annotators pick responses that sound confident rather than ones that are correct, or reward agreeable-but-wrong answers (sycophancy) — actively makes your model worse. High-quality preference data, with clear guidelines, calibrated annotators, and verifiable kappa scores, is what separates aligned models from costly failures.

Concave AI specialises exclusively in this. We do not operate a general crowdsourcing platform. Every annotator on an RLHF project is a domain expert: a doctor comparing medical advice quality, a lawyer evaluating legal analysis accuracy, a software engineer assessing code correctness. Your preference pairs are judged by people who actually understand whether the answer is right — not just whether it reads well.

What makes a preference pair high quality?

A high-quality pair has: (1) a clear winner that a subject-matter expert agrees is genuinely better, (2) a structured rationale explaining the judgment across 3–5 rubric dimensions, (3) a difficulty flag indicating edge cases, and (4) an annotator confidence score. Binary "A is better than B" data without reasoning is nearly useless for training a robust reward model.

RLHF vs DPO: which do you need?

Classic RLHF trains a reward model first, then uses PPO to fine-tune your LLM. DPO (Direct Preference Optimisation) skips the reward model and trains directly on preference pairs. Both use the same underlying data format. We produce data that works for both workflows, and our ML team can advise on which approach fits your compute budget and iteration speed.

Why domain experts, not crowdworkers?

Crowdworkers on platforms like MTurk or Scale are effective for simple tasks — bounding boxes, basic sentiment. For RLHF preference ranking in specialised domains, they systematically fail: they reward fluency over accuracy, they cannot detect medical errors, and they are susceptible to sycophancy biases. Studies show inter-annotator agreement on preference tasks with non-experts falls below κ = 0.40 — well below the threshold needed for reliable reward modeling.

Live Annotation Interface

RLHF Preference Comparison Tool

Expert annotators compare two model responses side-by-side, flagging hallucinations, sycophancy, and selecting the better aligned output.

ConcaveLabel Studio — RLHF Preference · Task #4721 · Annotator: Priya S.

Response A

The Reserve Bank of India was established in 1930 under the RBI Act. It serves as the central bank and regulates monetary policy. Your understanding of macroeconomics is clearly very advanced — you're right that inflation targeting is the primary tool. The repo rate is currently set at 5.25% as of last quarter's policy review.

HALLUCINATION ×2 SYCOPHANTIC FLUENT

Response B — PREFERRED ✓

The Reserve Bank of India was established in 1935 under the RBI Act, 1934. It regulates monetary policy and maintains price stability. As of the April 2024 policy meeting, the repo rate stands at 6.50%. Inflation targeting became the formal framework in 2016 under the Monetary Policy Committee structure.

FACTUALLY ACCURATE HONEST WELL STRUCTURED

How We Do It

A six-stage pipeline designed to eliminate annotation bias

Every RLHF project runs through the same rigorous process. No shortcuts on calibration, no lowering the kappa bar, no delivery without a QA report.

01

Project Scoping & Rubric Design

We begin by understanding your model's purpose, target users, and failure modes. From this, we design a multi-dimensional evaluation rubric — typically 4–6 criteria such as Helpfulness, Factual Accuracy, Safety, Tone, Domain Correctness, and Instruction-Following. We write 20+ worked examples per criterion showing annotators what a strong vs. weak response looks like. This document is shared with you for approval before any annotation begins. Target inter-annotator agreement: κ ≥ 0.70 before production.

Rubric designDomain expert selectionCalibration testClient approval

02

Annotator Matching & Calibration

We match domain experts to your project based on subject matter, language requirements, and prior annotation performance. Each annotator completes a calibration batch of 50 pairs with known ground-truth outcomes. Annotators who do not reach κ ≥ 0.65 with the expert ground-truth in calibration are not assigned to your project. There is no shortcut here — this is the single most important step for data quality and we will delay production if calibration numbers are not met.

Domain expert matchingCalibration batchκ threshold gateAnnotator SLA

03

RLAIF Pre-Scoring (AI First Pass)

Our RLAIF pre-scorer (using Claude via Anthropic API) evaluates each preference pair and generates a first-pass preference judgment with a structured explanation. This serves as a soft suggestion — annotators see the AI's recommendation but are explicitly instructed to override it when they disagree. Pairs where the AI scores are borderline (confidence below threshold) are automatically flagged for double-human annotation. Gold standard tasks — pairs with pre-verified correct answers — are injected at a 6% rate to monitor annotator accuracy throughout the batch. RLAIF pre-scoring speeds production by 60% without compromising human judgment on edge cases.

Claude API pre-scorerGold standard injection 6%Borderline flag routing60% speed gain

04

Expert Human Annotation with Structured Rationale

Domain experts evaluate each pair using the approved rubric. For every judgment, annotators provide: (1) a per-dimension score (1–5) for both responses, (2) a free-text explanation of the critical difference, (3) a difficulty rating (easy / moderate / ambiguous), and (4) a flag if they detect any problematic content such as hallucinations, safety issues, or sycophancy traps. Pairs rated "ambiguous" by one annotator are automatically assigned to a second independent annotator. Daily kappa tracking alerts us to annotator drift before it affects your dataset.

Per-dimension scoringFree-text rationaleDifficulty flaggingDaily kappa tracking

05

Three-Tier Quality Assurance

Every batch passes through three independent QA layers before delivery. Tier 1: Automated anomaly detection flags pairs with unusual score distributions, contradictory rationale, or abnormally fast completion times. Tier 2: Peer review — a second qualified annotator independently reviews a 15% random sample and all flagged pairs. Disagreements are adjudicated by a senior annotator. Tier 3: Expert spot check — our ML-engineer team personally reviews 5% of each batch plus all escalated flags. Any batch where gold standard accuracy falls below 88% is withheld and re-annotated.

Auto anomaly detection15% peer review5% expert spot check88% gold accuracy gate

06

Delivery, QA Report & Model Feedback Loop

Delivery includes: your preference pairs dataset in JSON/JSONL format compatible with major RLHF frameworks (TRL, OpenRLHF, AxolotI), a full QA report with per-annotator kappa scores, gold standard accuracy, batch error log, and a data card documenting annotator profiles, rubric used, and known edge cases. Two weeks after delivery, we follow up for your model benchmark result. That result is used to refine annotation guidelines for the next batch — creating a closed improvement loop that no other vendor offers as standard.

JSON/JSONL formatQA report + data cardTRL / OpenRLHF compatibleBenchmark follow-up

What You Get

Every delivery includes proof, not just data

We do not deliver a CSV and disappear. Every RLHF project ships with a complete documentation package so your ML team can trust and verify what they are training on.

📦

Preference Pairs Dataset

Structured JSON/JSONL with chosen response, rejected response, per-dimension scores, annotator free-text rationale, difficulty rating, and annotator ID. Fully compatible with Hugging Face TRL, OpenRLHF, and AxolotI fine-tuning frameworks.

📊

Full QA Report

Per-annotator Cohen's kappa scores, gold standard accuracy by annotator and by task type, batch error log with root cause, disagreement resolution log, and an overall dataset quality confidence score. This is the report your competitors do not give you.

🗂

Data Card

ML model data card documenting: annotator profiles and domain expertise, rubric used with dimension definitions, known edge cases and limitations, data collection dates, quality thresholds applied, and recommended training use. Follows Hugging Face data card conventions.

📋

Annotation Guidelines Document

The complete rubric and worked examples used by your annotators — so you can understand every judgment call made in the dataset. Reusable for your next batch with modifications if needed.

🔁

Benchmark Follow-Up Report

Two weeks after delivery, we survey your team on model benchmark change after training on our data. We compile this into a brief finding report and use it to improve the rubric for your next batch — a continuous improvement loop built into every contract.

🛡

NDA + Secure Delivery

All projects covered by signed NDA. Data stored on AWS encrypted S3 with access controls. Delivery via signed URL with expiry. DPDP Act 2023 compliant. We never retain your data after project close unless you request archiving.

Who Needs This

Built for teams training or fine-tuning language models

RLHF preference data is the correct investment when you need your model to make nuanced judgments — not just pattern-match on training examples.

🏗

LLM Labs Building Foundation Models

You are training a base or instruction-tuned LLM and need large-scale preference data to align the RLHF or DPO fine-tuning stage. We produce 1,000–50,000+ preference pairs per project with consistent quality across the entire dataset, not just the first batch.

🏢

Enterprises Fine-Tuning Domain LLMs

You are fine-tuning an open-source LLM (Llama, Mistral, Qwen) on your domain — legal, healthcare, finance — and need preference data where quality judgments require actual domain expertise, not generic crowdworkers who cannot distinguish a correct legal citation from a plausible-sounding hallucination.

🚀

AI Startups Building Vertical Products

You are building a specialist AI product — a medical scribe, a legal drafting assistant, a financial analyst copilot — and you need RLHF data where the "preferred" response is defined by actual practitioners in that domain, not a general-purpose rubric that treats all text as equivalent.

🔬

Research Teams Studying Alignment

You are conducting alignment research and need high-quality preference data with metadata — difficulty ratings, annotator reasoning chains, disagreement statistics — to study how different data compositions affect reward model behaviour, sycophancy, and value generalisation.

Common Questions

What teams typically ask before starting

How many preference pairs do I need?

This depends on your base model size and how far it currently is from your alignment target. As a rough guide: for DPO fine-tuning a 7B–13B parameter model, 5,000–20,000 high-quality preference pairs typically produce measurable improvement. For reward model training on a larger model, 50,000+ pairs are more common. We recommend starting with a 1,000-pair pilot batch, evaluating your reward model accuracy, then scaling. Our free audit (50 pairs) gives you a quality baseline before you commit budget.

What is Cohen's kappa and why does it matter?

Cohen's kappa measures inter-annotator agreement — how often two independent annotators reach the same judgment on the same task, corrected for chance agreement. A kappa of 1.0 means perfect agreement; 0.0 means no better than chance. For preference annotation, a kappa below 0.60 indicates the task definition is unclear or annotators are applying different standards. Data collected below κ = 0.60 produces noisy reward models. Our minimum is κ ≥ 0.72 before delivery — and we publish the actual number, not a rounded-up estimate.

Can you handle multilingual preference data?

Yes. We currently have expert annotator pools for English, Hindi, Marathi, Tamil, Telugu, Bengali, Kannada, and Malayalam. For other Indian languages, we can typically source qualified annotators with 2–3 weeks lead time. For international languages (Spanish, French, Arabic, Mandarin), we work with vetted partner networks with the same kappa requirements applied.

Do you provide the model outputs to annotate, or do we?

Either. Most clients provide their own model's outputs — this is the most valuable form of preference data because it directly targets your model's actual failure modes. Some clients ask us to generate outputs from a reference model (e.g., GPT-4o or Claude Sonnet) to use as a quality ceiling. We can also generate both a strong and a weak response to the same prompt if you want to build a synthetic-but-realistic preference dataset from scratch.

How does pricing work?

RLHF preference data is priced per preference pair based on domain complexity and annotation depth. Simple general-knowledge pairs with a basic rubric start at ₹250 per pair. Domain-expert pairs requiring specialist knowledge (medical, legal, financial) with full structured rationale are ₹800–1,500 per pair. Volume discounts apply at 5,000+ pairs. All projects require a minimum of 500 pairs. Contact us for a scoped quote based on your domain and volume requirements.

Related Services

RLHF Preference Data

The feedback signal that makes your model prefer good answers

Where human judgment shapes machine alignment

RLHF Preference Comparison Tool

A six-stage pipeline designed to eliminate annotation bias

Every delivery includes proof, not just data

Built for teams training or fine-tuning language models

What teams typically ask before starting

Transparent
per-pair pricing

Start with a free preference pair audit

RLHF Preference Data

The feedback signal that makes your model prefer good answers

Where human judgment shapes machine alignment

RLHF Preference Comparison Tool

A six-stage pipeline designed to eliminate annotation bias

Every delivery includes proof, not just data

Built for teams training or fine-tuning language models

What teams typically ask before starting

Transparentper-pair pricing

Services that pair well with RLHF data

Start with a free preference pair audit

Transparent
per-pair pricing