Why 38% of RLHF preference data trains your model to lie — and how to detect it before training
Sycophancy enters RLHF pipelines when annotators reward confident agreement over factual accuracy. Here is the mechanism, the measurement protocol, and the annotation-level fix.
Read the full analysis →