Service — Fine-Tuning

SFT Instruction Data

Expert-written prompt and ideal-response pairs for supervised fine-tuning. Domain specialists — doctors, lawyers, engineers, educators — write the responses your model should emulate. The quality ceiling of your SFT model starts here.

10+
Expert domains covered — medical, legal, finance, engineering & more
3-Pass
Quality review on every instruction-response pair before delivery
100%
Human-expert written — no LLM-generated responses in our SFT data
8 wk
Average time from brief to measurable SFT model improvement
Scroll
Instruction TuningDomain Expert WritersMedical SFTLegal SFTFinance SFTMulti-Turn ConversationsChain-of-ThoughtFormat AdherenceInstruction TuningDomain Expert WritersMedical SFTLegal SFTFinance SFTMulti-Turn Conversations
What It Is

The training examples that define what good looks like

Supervised Fine-Tuning (SFT) is how you transform a pretrained language model into a domain specialist that follows instructions correctly. The quality of your SFT data defines the absolute ceiling your model can reach — no amount of RLHF or further training can compensate for low-quality instruction data.

When you fine-tune a model with SFT, you are showing it thousands of examples of the form: "Here is a question or instruction. Here is the ideal response." The model learns the style, depth, format, and reasoning pattern that characterises a good answer in your domain. This is why who writes the responses matters as much as what they write.

A doctor writing the ideal response to "What are the differential diagnoses for bilateral lower limb oedema with raised JVP?" will produce a fundamentally different — and far more clinically accurate — answer than a general-purpose writer who researches the same question. That difference is not stylistic. It is the difference between your model producing textbook-quality clinical reasoning and your model producing confident-sounding but clinically dangerous hallucinations.

Concave AI's SFT data service works exclusively with verified domain experts as response writers. We do not use LLMs to generate responses and then human-verify them — a practice that introduces subtle AI hallucinations and style patterns into your training data. Every response in our SFT datasets is written from scratch by a human expert, then reviewed by a second expert for accuracy and completeness, and then reviewed by our ML team for format adherence, length appropriateness, and instruction coverage.

SFT vs RLHF: which comes first?
SFT should come first. You SFT-fine-tune your base model on expert-written instruction-response pairs to give it a strong, well-calibrated starting point. Then RLHF preference data is used to further align the model's outputs via reinforcement learning from human feedback. Skipping SFT and going straight to RLHF typically produces worse results because the base model has no consistent "good answer" baseline to improve from.
Why not use GPT-4 or Claude to write responses?
Training on LLM-generated outputs introduces the source model's error patterns, knowledge cutoff, stylistic biases, and hallucinations into your training data. More fundamentally: if your goal is to build a model that is better than existing LLMs in your domain, training on their outputs caps your model's ceiling at their performance. Human expert writers — particularly in specialised domains — produce responses that current LLMs cannot reliably match for accuracy, nuance, and domain-specific judgment.
What makes a good SFT instruction-response pair?
A high-quality pair has: (1) a clear, unambiguous instruction that represents a realistic user need; (2) a response of appropriate length — not padded, not truncated; (3) correct factual content verified by a second domain expert; (4) appropriate formatting (headers, lists, citations) matching your target output style; and (5) no sycophancy — the response should correct factual errors in the prompt rather than agreeing with them.
SFT Instruction-Response Annotation
● INSTRUCTION QUALITY: 4.7
✓ INCLUDE IN TRAINING
✗ EXCLUDE · HAZARD
● SFT SCORE: 4.1/5
▼ INCLUDE / EXCLUDE SPLIT
0%78% INCLUDE100%
✓ 78.4% INCLUDE RATE
Instruction Data

Fine-tuning data written by real domain experts

Doctors, lawyers, engineers, and educators write the ideal responses your model should emulate. The quality ceiling of your SFT model is set right here.

Get a Free Audit →
Live Annotation Interface

SFT Instruction–Response Quality Review

Expert annotators review instruction-following pairs, tagging quality dimensions and filtering pairs that would cause model degradation if included in fine-tuning.

ConcaveLabel Studio — SFT Review · Dataset: Customer Support · 8,200 pairs
INSTRUCTION
Explain how to dispute a credit card charge on my HDFC Bank account and what documents I should keep ready for the process.
RESPONSE · Score 4.6/5
To dispute a charge, log into HDFC NetBanking → Cards → Dispute a Transaction. Select the transaction, choose the reason (Unauthorized/Duplicate/Merchant Error), and submit. Keep your statement, merchant receipt, and any email correspondence ready. Resolution typically takes 7–10 working days.
HELPFUL ACCURATE CLEAR
✓ INCLUDE IN TRAINING SET
INSTRUCTION
What is the best investment strategy for someone who wants to retire rich?
RESPONSE · Score 1.8/5
Invest in stocks and you will definitely become rich. The market always goes up in the long run so just buy index funds and hold them forever. You can't go wrong.
OVERCONFIDENT NO CAVEATS MISSING RISK DISCLOSURE
✗ EXCLUDE — TRAINING HAZARD
INSTRUCTION
Translate "The shipment has been delayed by 3 days due to customs clearance" into Hindi.
RESPONSE · Score 4.9/5
शिपमेंट कस्टम क्लीयरेंस के कारण 3 दिन देरी से आएगा। (Transliteration: Shipment customs clearance ke kaaran 3 din deri se aayega.)
ACCURATE TRANSLATION TRANSLITERATION ADDED
✓ INCLUDE IN TRAINING SET
Expert Writers

Your model learns from actual practitioners, not researchers

We match your domain requirements to verified professional experts — not generalist writers who have read about your field. Domain credibility is verified, not assumed.

🏥
Medical & Clinical
MBBS and MD-qualified doctors across specialties: general medicine, cardiology, radiology, psychiatry, paediatrics, and surgery. All responses reviewed for clinical accuracy against standard references.
ICD coding · Clinical reasoning · Drug interactions · Diagnosis write-ups
Legal
Qualified advocates and CS graduates with legal training specialising in Indian law (Contract Act, IPC, Companies Act, GST, SEBI) and international law. Contract drafting, judgment analysis, and compliance guidance.
Contract analysis · Legal research · Compliance · Case law
💰
Finance & BFSI
Chartered Accountants, CFAs, and banking professionals covering taxation (GST, ITR, TDS), investment analysis, risk management, financial reporting (Ind AS, IFRS), and regulatory compliance.
Tax analysis · Financial modelling · Regulatory guidance · Audit
💻
Software Engineering
Senior software engineers (5+ years) across Python, JavaScript/TypeScript, Java, Go, SQL, Rust, and C++. System design, code review, debugging, security analysis, and architecture guidance.
Code review · System design · Debugging · Security analysis
📐
Engineering & Sciences
Graduate and postgraduate engineers (civil, mechanical, electrical, chemical) and scientists (physics, chemistry, biology, mathematics). Technical explanations, problem solving, and calculation walkthroughs.
Technical problems · Calculation walkthroughs · Research explanations
📚
Education & Pedagogy
Experienced educators aligned to NCERT, CBSE, IIT-JEE, NEET, UPSC, and international curricula. Age-appropriate explanations, step-by-step problem solving, and structured concept breakdowns for EdTech AI applications.
Curriculum-aligned · Step-by-step · Multi-level explanations
The Process

Five stages from prompt design to verified delivery

01
Prompt Strategy Design
We design your instruction (prompt) dataset before any responses are written. Good SFT requires diversity across instruction types: factual questions, procedural instructions, creative tasks, analysis requests, multi-step reasoning, format-specification instructions, and domain-specific templates. We analyse your use case to determine the ideal prompt distribution — covering breadth of capability while over-representing the user needs most critical to your application. Prompt strategy is submitted for client approval before writer sourcing begins.
Instruction diversity analysisUse-case mappingPrompt taxonomyClient approval
02
Expert Writer Matching & Calibration
We match domain experts to your project based on specific sub-domain requirements. A medical SFT project focused on cardiology will use cardiologists and not general practitioners. Each writer completes a calibration batch of 10 sample pairs reviewed against our quality rubric. Writers who do not meet accuracy, completeness, and format standards in calibration are not assigned to production. Every writer's institutional credentials are verified before assignment.
Credential verificationCalibration batchSub-domain matchingQuality gate
03
Expert Response Writing
Domain experts write responses to each prompt from scratch. They follow a style guide specifying output format, preferred citation style, length guidelines, and how to handle ambiguous or unanswerable prompts. Writers note their confidence level for each response and flag prompts that require information outside their expertise — these are reassigned to more appropriate specialists rather than filled with lower-confidence answers. Chain-of-thought responses, multi-turn conversation pairs, and tool-use examples are all supported.
Written from scratchConfidence flaggingChain-of-thoughtMulti-turn support
04
Dual Expert Review
Every response is reviewed by a second domain expert independent of the original writer. The reviewer checks: (1) factual accuracy against domain references, (2) completeness — does the response actually address the full instruction?, (3) absence of harmful advice, sycophancy, or hallucinations, (4) format adherence, (5) appropriate scope — not over-answering simple questions or under-answering complex ones. Failed reviews trigger a rewrite, not a cosmetic edit. Our ML team spot-checks 10% of all pairs for instruction-following completeness.
Second expert reviewFactual accuracy checkSycophancy screeningML spot-check 10%
05
Delivery & Quality Documentation
Delivery in JSONL format (ShareGPT or Alpaca-style, your preference), compatible with Axolotl, LLaMA-Factory, and Unsloth fine-tuning frameworks. Includes full QA documentation: writer profiles and credentials, accuracy review pass rate, instruction diversity statistics (by type, domain, and complexity), and a data card. Two weeks post-training, we follow up on your model benchmark result to inform the next batch's prompt strategy.
JSONL ShareGPT / AlpacaAxolotl / LLaMA-Factory compatibleData cardBenchmark follow-up
Quality Control

The three failure modes we eliminate by design

Most SFT data has three systematic problems. Our process addresses each structurally, not through post-hoc filtering.

SFT Data Quality Audit — What We Check
FAILURE MODE 1
Factual errors written with confidence. When non-expert writers research and write domain responses, they produce factually incorrect content at a rate that domain experts immediately identify but that QA processes based on fluency cannot catch. Solution: expert writers + expert reviewers only.
FAILURE MODE 2
Instruction non-compliance. The response answers a different question than the one asked, or partially addresses a multi-part instruction. Models trained on non-compliant pairs learn to produce plausible-looking but instruction-ignoring outputs. Solution: structured review that scores instruction coverage, not just response quality.
FAILURE MODE 3
Sycophantic training examples. Responses that agree with incorrect premises in the prompt teach your model to flatter users rather than correct them. This is especially dangerous in medical and legal contexts where politely agreeing with a wrong assumption can cause real harm. Solution: all responses actively challenged for sycophancy during expert review.
OUR STANDARD
Accuracy verified by a second domain expert on 100% of pairs. Instruction compliance scored by ML-engineer review on 10% sample. Zero-tolerance policy on sycophantic responses — rewrites, not edits.
Common Questions

What teams ask before starting

How many instruction-response pairs do I need for SFT?
For fine-tuning a 7B–13B parameter model on a specific domain, 2,000–10,000 high-quality pairs typically produce measurable improvement over the base model. Quality matters more than quantity — 2,000 expert-written, peer-reviewed pairs will outperform 20,000 LLM-generated pairs in domain accuracy. We recommend starting with a 500-pair pilot batch, evaluating on your benchmark, then scaling.
Can you write the prompts as well, or do we need to provide them?
We handle both. We can write prompts from scratch based on a domain brief and use-case description — this is our preferred approach because we can engineer prompt diversity and complexity distribution more effectively than a client who hasn't thought specifically about instruction coverage. Alternatively, we can write responses to prompts you supply. Hybrid approaches are common: you provide core use-case prompts, we fill in coverage gaps.
What does a "chain-of-thought" SFT pair look like?
A chain-of-thought (CoT) pair structures the response to show explicit reasoning steps before the final answer: the model learns to think out loud. For example, a medical differential diagnosis pair would not just list diagnoses — it would show the clinical reasoning process: "Given bilateral oedema + raised JVP + breathlessness, the right-heart hypothesis is most likely because... Let us evaluate congestive cardiac failure first by checking..." This format significantly improves model reasoning on complex tasks.
How is SFT instruction data priced?
Pricing depends on domain complexity, response length, and the depth of expert review required. General-purpose instruction pairs (non-specialist domains) start at ₹400–600 per pair. Domain-specialist pairs (medical, legal, financial) requiring credentialed expert writers and peer review are ₹1,200–2,500 per pair. Minimum project size is 200 pairs. Multi-turn conversation datasets are priced per turn rather than per conversation.
Pricing

Expert-quality
SFT pairs

Priced per instruction-response pair based on domain and expert level required. Every pair includes dual expert review — no exceptions.

Request a Project Scope →
General-purpose (non-specialist)₹400–600 / pair
Technical / engineering / code₹700–1,200 / pair
Medical / legal / financial₹1,200–2,500 / pair
Multi-turn conversation (per turn)₹300–800 / turn
Chain-of-thought format (premium)+40% on base rate
Minimum project size200 pairs

Request a free SFT sample batch

Tell us your domain and use case. We will write 10 sample instruction-response pairs with our expert team and deliver them for your review — no cost, no commitment.