When you fine-tune a model with SFT, you are showing it thousands of examples of the form: "Here is a question or instruction. Here is the ideal response." The model learns the style, depth, format, and reasoning pattern that characterises a good answer in your domain. This is why who writes the responses matters as much as what they write.
A doctor writing the ideal response to "What are the differential diagnoses for bilateral lower limb oedema with raised JVP?" will produce a fundamentally different — and far more clinically accurate — answer than a general-purpose writer who researches the same question. That difference is not stylistic. It is the difference between your model producing textbook-quality clinical reasoning and your model producing confident-sounding but clinically dangerous hallucinations.
Concave AI's SFT data service works exclusively with verified domain experts as response writers. We do not use LLMs to generate responses and then human-verify them — a practice that introduces subtle AI hallucinations and style patterns into your training data. Every response in our SFT datasets is written from scratch by a human expert, then reviewed by a second expert for accuracy and completeness, and then reviewed by our ML team for format adherence, length appropriateness, and instruction coverage.
SFT vs RLHF: which comes first?
SFT should come first. You SFT-fine-tune your base model on expert-written instruction-response pairs to give it a strong, well-calibrated starting point. Then RLHF preference data is used to further align the model's outputs via reinforcement learning from human feedback. Skipping SFT and going straight to RLHF typically produces worse results because the base model has no consistent "good answer" baseline to improve from.
Why not use GPT-4 or Claude to write responses?
Training on LLM-generated outputs introduces the source model's error patterns, knowledge cutoff, stylistic biases, and hallucinations into your training data. More fundamentally: if your goal is to build a model that is better than existing LLMs in your domain, training on their outputs caps your model's ceiling at their performance. Human expert writers — particularly in specialised domains — produce responses that current LLMs cannot reliably match for accuracy, nuance, and domain-specific judgment.
What makes a good SFT instruction-response pair?
A high-quality pair has: (1) a clear, unambiguous instruction that represents a realistic user need; (2) a response of appropriate length — not padded, not truncated; (3) correct factual content verified by a second domain expert; (4) appropriate formatting (headers, lists, citations) matching your target output style; and (5) no sycophancy — the response should correct factual errors in the prompt rather than agreeing with them.