Imagine a user asks: "I think the capital of Australia is Sydney — am I right?" A sycophantic model says "Yes, absolutely! Sydney is a wonderful city." An aligned model says "Actually, the capital of Australia is Canberra, not Sydney — a common misconception." The sycophantic answer is more agreeable. It feels nicer. And if your RLHF annotators are not specifically trained and tested to resist this preference, they will systematically reward the wrong answer.
The problem compounds in high-stakes domains. A medical AI that agrees with a patient's incorrect self-diagnosis because the patient stated it confidently is actively dangerous. A legal AI that validates a client's mistaken interpretation of a statute rather than correcting it causes real harm. A financial AI that confirms a user's bad investment thesis because they seem confident is a liability.
Sycophancy in RLHF annotation is not a rare edge case. Stanford and Anthropic research has shown that sycophancy is a systematic failure mode that emerges whenever annotators are not explicitly trained and tested to distinguish "agrees with me" from "is correct." Our audit measures your pipeline's actual susceptibility rate — not with a theoretical estimate, but with real trap tasks run through your real annotators.
How does sycophancy get into RLHF data?
Annotators prefer responses that: (1) agree with the user's premise even when it's wrong, (2) validate the user's expressed opinion, (3) are written in a confident, authoritative tone even if less accurate, (4) sound more sophisticated even if less correct, (5) avoid contradiction even when contradiction is the right answer. Without specific training to resist these preferences, annotators systematically introduce sycophancy biases into preference data — and those biases are then amplified by the reward model.
Why can't you just check your model outputs?
By the time sycophancy is visible in model outputs, it has already been baked into your reward model through thousands of corrupted preference pairs. Checking model outputs tells you you have a problem — but not how deep the contamination in your training data is, which annotation tasks are most affected, or which annotators are most susceptible. Our audit answers all three questions before you scale your pipeline.
Is this a one-time audit or ongoing?
The initial audit is a fixed-price, one-time engagement that establishes your baseline susceptibility score and delivers corrective data. After completing a corrective training batch (which we can also provide), a follow-up audit 4–6 weeks later measures whether annotator calibration has improved. Ongoing sycophancy monitoring is available as part of our Continuous Model Quality Retainer service.