Resources · Technical Insights

Deep dives into data quality for AI

Technical insights on annotation quality, RLHF alignment, and the data engineering decisions that determine whether AI models work in production. Written by ML engineers, for ML engineers.

12 min read
Why 38% of RLHF preference data trains your model to lie — and how to detect it before training
Sycophancy enters RLHF pipelines when annotators reward confident agreement over factual accuracy. Here is the mechanism, the measurement protocol, and the annotation-level fix.
Read the full analysis →
18 min read
How to Train an AI Model: The Complete 2026 Guide to Workflow, Data, and Getting It Right
Pre-training, fine-tuning, or RLHF — choosing the wrong approach costs six weeks and hundreds of thousands of rupees. This guide covers the full training workflow, modality-specific data requirements, real cost breakdowns, and the six annotation mistakes that silently cap your model's ceiling.
Read the full guide →
10 min read
Cohen's kappa explained for ML engineers — the annotation quality metric your pipeline probably is not measuring
Inter-annotator agreement is the single most important quality metric in data annotation. Here is what it measures, how to interpret it, and why "98% accuracy" without kappa is meaningless.
Read the full guide →
Coming soon
Agriculture AI
Why crop disease models trained on Western data fail on Indian farms
PlantVillage contains zero Indian crop varietals. The accuracy gap on Pusa Ruby, IR-64, and MCU-5 is measurable — and fixable at the annotation level.
Finance & BFSI
What your credit AI learned from annotators who cannot read a balance sheet
Financial AI annotation requires domain-qualified professionals. Generic annotators produce models that are confidently wrong on the queries that matter most.
Legal AI
How legal AI hallucinates Supreme Court judgments — and why the fix is in the data
SFT data with unverified citations teaches models that citation format is rewarded regardless of whether the case exists. The fix is claim-level verification.

Want to see these principles applied to your data?

Free model audit — 50 outputs evaluated, sycophancy and hallucination findings in 5 working days. No cost, no commitment.

Request Free Audit →