When an AI model says "The Indian Companies Act 2013 requires all directors to hold a minimum of one share," that sentence contains a specific, verifiable factual claim. It is either correct or incorrect. Document-level quality review cannot reliably catch this — it requires someone with legal expertise to check that specific claim against the correct statute text.
Most AI evaluation approaches rate output quality holistically — coherence, fluency, helpfulness — without systematically extracting and verifying individual factual claims. This means a response rated 4/5 for "overall quality" can contain multiple significant factual errors that the quality rater simply missed or was not equipped to catch.
Our hallucination detection pipeline inverts this. We first use an LLM-based claim extractor to decompose every AI output into its individual factual claims — typically 3–15 claims per response. Each claim is then routed to a domain expert for verification against authoritative sources. Claims are classified into four severity tiers: Verified (correct), Minor Inaccuracy (partially correct), Significant Hallucination (meaningfully wrong), and Critical Fabrication (dangerous or completely invented). You receive a per-claim report, an overall hallucination rate, and a severity breakdown — actionable data, not an impressionistic quality score.
Why does hallucination happen?
Language models generate text by predicting statistically likely continuations — they do not retrieve facts from a knowledge database and verify them before outputting. When a plausible-sounding but false claim is the statistically likely continuation of a prompt, the model produces it confidently. Fine-tuning and RLHF reduce hallucination rates but do not eliminate them — especially in long-tail knowledge domains where training data is sparse.
What is the difference between a hallucination and a factual error?
We use "hallucination" specifically for AI-generated claims that are false and not traceable to any supporting source — the model invented the fact. Factual errors include claims that are partially true (outdated information, misattributed statistics, wrong names for real entities). Our severity classification distinguishes between these: fabrications (invented) score higher severity than inaccuracies (partially wrong but traceable to real information).
Who needs hallucination detection?
Any team deploying AI that makes factual claims: medical AI (wrong drug information, incorrect diagnostic criteria), legal AI (fabricated case citations, wrong statute details), financial AI (incorrect regulatory requirements, wrong figures), educational AI (wrong dates, incorrect scientific facts), and enterprise RAG systems (faithfulness failures where the generation does not match the retrieved context).