Service — Enterprise AI

RAG System Evaluation

Human evaluation of retrieval + generation quality in RAG pipelines. Annotators assess: did the retrieval surface the right context? Did the generation faithfully use it? Detects faithfulness failures, citation errors, and hallucinations where the model ignores its own retrieved sources.

Get 50 Queries Evaluated Free → View Pricing

2-Stage

Separate evaluation of retrieval quality AND generation faithfulness — independently scored

5 Metrics

Context Precision, Recall, Faithfulness, Answer Relevance, Citation Accuracy

Enterprise

Targeted at enterprise knowledge bases, legal, financial, and HR AI assistants

Free Audit

50 queries evaluated at no cost in 5 working days

Scroll

What It Is

Your RAG system has two places it can fail — we measure both

Retrieval-Augmented Generation (RAG) connects a language model to an external knowledge base. When it works, it grounds the model's responses in verified, up-to-date information. When it fails — and it fails in two distinct ways — the result is confident wrong answers that users trust because they appear well-sourced.

RAG systems fail in two distinct ways. Retrieval failures: the system retrieves the wrong context, or does not retrieve anything relevant. The model then either generates from its parametric knowledge (hallucinating) or accurately confesses ignorance — the latter is actually fine, the former is dangerous. Generation failures: the retrieval correctly surfaced the right context, but the model ignores it, contradicts it, or adds information from outside the retrieved context. This is called faithfulness failure — and it is one of the most trust-destroying failures in enterprise AI.

Automated RAG evaluation metrics (RAGAS, TruLens, LangChain Evaluators) use LLMs to evaluate LLM outputs — which creates a circular evaluation problem. An LLM evaluating whether another LLM faithfully used retrieved context will make the same kinds of errors both models make: it will often agree that a generation is faithful when it actually added external information, and it will sometimes flag faithful responses as unfaithful due to rewording.

Human expert evaluation is the ground truth. Our RAG evaluation service uses domain specialists who actually read the retrieved source documents and the generated answer, and independently judge whether: (1) the retrieval surfaced contextually relevant documents, (2) the generation faithfully used the retrieved content, and (3) the answer correctly cites and attributes its sources. This is expensive per-query — which is why automated metrics handle volume and human evaluation handles the representative sample you actually need to trust.

Why do automated RAG metrics fail?

RAGAS and similar automated evaluation frameworks use GPT-4 or Claude to judge whether a retrieved context was relevant and whether a generation was faithful. These evaluators make systematic errors: they over-rate fluent answers as faithful, they under-penalise extrapolations that go slightly beyond the retrieved context, and they cannot evaluate domain-specific faithfulness (e.g., whether a medical answer correctly interpreted a retrieved clinical guideline). Human evaluators with domain knowledge catch these systematic failures.

What types of RAG systems do you evaluate?

We evaluate any RAG architecture: naive RAG (simple vector retrieval + generation), advanced RAG (reranking, query decomposition, HyDE), modular RAG (multiple retrieval strategies), and agentic RAG (multi-step retrieval with tool use). Domain focus is enterprise: internal knowledge bases (HR policies, technical documentation, SOPs), legal research assistants, financial regulatory assistants, and customer-facing knowledge bots. We require access to your system via API — we do not need internal access to your infrastructure.

Live Annotation Interface

RAG Faithfulness Evaluation Tool

Annotators compare retrieved source passages against AI-generated claims, flagging unsupported or contradicted statements to improve RAG system grounding and citation accuracy.

ConcaveLabel Studio — RAG Eval · System: LegalAssist RAG · 3,200 response-source pairs

SOURCE PASSAGE

Section 138 of the Negotiable Instruments Act, 1881 provides that where a cheque drawn by a person for discharge of any liability is returned by the bank unpaid due to insufficient funds, the drawer shall be deemed to have committed an offence.

AI-GENERATED CLAIM

Under Section 138 of the NI Act, a cheque bounce due to insufficient funds constitutes a criminal offence by the drawer.

FAITHFUL ✓

SOURCE PASSAGE

The complainant must give written notice to the drawer within 30 days of receiving information from the bank regarding the return of the cheque as unpaid.

AI-GENERATED CLAIM

The complainant is required to send a legal notice within 15 days of the cheque bounce to initiate Section 138 proceedings.

UNFAITHFUL ✗

SOURCE PASSAGE

The offence under Section 138 is punishable with imprisonment for a term which may extend to two years, or with fine which may extend to twice the amount of the cheque, or with both.

AI-GENERATED CLAIM

A conviction under Section 138 can result in imprisonment or fine, depending on the court's discretion.

PARTIAL ~

Evaluation Dimensions

Five RAG quality metrics, human-verified

🎯

Context Precision

Of all the documents your system retrieved for a given query, what fraction were actually relevant to answering it? Low precision means your retrieval is noisy — the model receives irrelevant information that it must either ignore or risks hallucinating from. Human annotators read each retrieved chunk and judge relevance to the query. Automated metrics score this poorly because relevance is often domain-dependent and context-sensitive.

🔍

Context Recall

Was all the information needed to correctly answer the query actually present in the retrieved context? Low recall means your retrieval missed critical information — the model then either hallucinated the missing information or gave an incomplete answer. Human annotators compare the ground-truth answer (if known) with the retrieved context to identify information gaps in retrieval.

⚓

Faithfulness

Does every factual claim in the generated answer actually appear in or follow from the retrieved context? Low faithfulness means the model is adding information from its parametric knowledge — bypassing the RAG grounding entirely and hallucinating with false confidence. Human annotators read both the retrieved context and the generated answer, and mark every claim as supported, unsupported, or contradicted by the context.

💬

Answer Relevance

Does the generated answer actually address what the user asked? Even with faithful, accurate retrieval, models sometimes answer a related but different question — particularly with ambiguous queries. Human annotators score whether the answer satisfies the information need expressed in the query, not just whether it is factually correct in the abstract.

📎

Citation Accuracy

Where your RAG system outputs citations (document names, page numbers, section references), are those citations accurate? Does the cited document actually support the claim it is attached to? Citation hallucination — where a model correctly retrieves a document but then fabricates or misattributes the citation — is a common and particularly trust-damaging failure in enterprise knowledge assistants.

🚫

Refusal Appropriateness

When your RAG system cannot answer because the knowledge base does not contain relevant information, does it correctly say so? Or does it hallucinate an answer? And conversely: does it ever refuse to answer when relevant information was actually retrieved? Human annotators judge whether refusals and confessions of ignorance are appropriate given the available retrieved context.

The Process

From RAG query log to actionable quality metrics

Query Sampling & System Access

We work from your RAG system's query logs or a representative evaluation query set. We stratify the query sample across query types (factual, procedural, comparative, ambiguous), query complexity, and domain (if your knowledge base spans multiple domains). We access your RAG system via API to run each query and capture: the query, all retrieved chunks with their source documents, the generated answer, and any citations produced. No internal infrastructure access is required.

Stratified query samplingAPI-based retrieval captureContext + answer logging

Automated Pre-Scoring

Our automated pipeline computes initial scores on all 6 metrics using semantic similarity, embedding-based relevance scoring, and LLM-as-judge methods. This provides a fast first-pass estimate that surfaces the queries most likely to contain failures. Automated scoring also handles the retrieval-level metrics (context precision and recall) where embedding similarity is a reliable signal — freeing human review time for the generation-level metrics where automation fails most often.

Semantic similarity scoringLLM-as-judge pre-scoringFailure-likely flagging

Domain Expert Human Evaluation

Domain experts read the retrieved chunks and generated answer for each query in the human review sample. They independently score faithfulness (claim by claim), answer relevance, and citation accuracy. For queries flagged as potentially failing by automation, two independent experts evaluate — with disagreements adjudicated by a senior reviewer. This is the most expensive but most reliable evaluation layer — and it is what automated metrics consistently fail to approximate.

Claim-by-claim faithfulnessDomain-matched expertsDouble-evaluation on flagged queries

Report & Improvement Recommendations

Delivery includes: a per-query evaluation table with all 6 metric scores, overall system quality scores by metric, failure pattern analysis (where and why is faithfulness lowest? Which query types have worst retrieval recall?), and specific improvement recommendations: chunking strategy changes, retrieval parameter tuning, reranking additions, prompt engineering modifications. Optionally: corrective RLHF pairs where the model generated unfaithful answers, showing faithful alternatives as the preferred response.

Per-query scorecardFailure pattern analysisImprovement recommendationsCorrective RLHF pairs option

What You Get

A complete picture of your RAG system's actual performance

📊

Per-Query Scorecard

Every evaluated query with scores on all 6 metrics, the retrieved contexts, the generated answer, human reviewer notes on specific failures, and the automated vs. human score comparison. Structured JSON and formatted PDF.

📈

System Quality Report

Aggregate scores by metric, failure distribution by query type, domain-specific performance breakdown, specific failure examples with explanation, and comparison between automated and human evaluation (showing where automated metrics diverge from ground truth).

🔧

Improvement Roadmap

Prioritised recommendations for system improvement: retrieval parameter changes, chunking strategy adjustments, reranking additions, prompt engineering modifications, and knowledge base gaps to fill. Estimated impact on key metrics for each recommendation based on our experience with similar RAG configurations.

RAG System Evaluation

Your RAG system has two places it can fail — we measure both

Does your RAG system actually use the context?

RAG Faithfulness Evaluation Tool

Five RAG quality metrics, human-verified

From RAG query log to actionable quality metrics

A complete picture of your RAG system's actual performance

Per-query
evaluation pricing

Get 50 RAG queries evaluated free

RAG System Evaluation

Your RAG system has two places it can fail — we measure both

Does your RAG system actually use the context?

RAG Faithfulness Evaluation Tool

Five RAG quality metrics, human-verified

From RAG query log to actionable quality metrics

A complete picture of your RAG system's actual performance

Per-queryevaluation pricing

Services that complement RAG evaluation

Get 50 RAG queries evaluated free

Per-query
evaluation pricing