RAG systems fail in two distinct ways. Retrieval failures: the system retrieves the wrong context, or does not retrieve anything relevant. The model then either generates from its parametric knowledge (hallucinating) or accurately confesses ignorance — the latter is actually fine, the former is dangerous. Generation failures: the retrieval correctly surfaced the right context, but the model ignores it, contradicts it, or adds information from outside the retrieved context. This is called faithfulness failure — and it is one of the most trust-destroying failures in enterprise AI.
Automated RAG evaluation metrics (RAGAS, TruLens, LangChain Evaluators) use LLMs to evaluate LLM outputs — which creates a circular evaluation problem. An LLM evaluating whether another LLM faithfully used retrieved context will make the same kinds of errors both models make: it will often agree that a generation is faithful when it actually added external information, and it will sometimes flag faithful responses as unfaithful due to rewording.
Human expert evaluation is the ground truth. Our RAG evaluation service uses domain specialists who actually read the retrieved source documents and the generated answer, and independently judge whether: (1) the retrieval surfaced contextually relevant documents, (2) the generation faithfully used the retrieved content, and (3) the answer correctly cites and attributes its sources. This is expensive per-query — which is why automated metrics handle volume and human evaluation handles the representative sample you actually need to trust.