Service — Enterprise AI

RAG System Evaluation

Human evaluation of retrieval + generation quality in RAG pipelines. Annotators assess: did the retrieval surface the right context? Did the generation faithfully use it? Detects faithfulness failures, citation errors, and hallucinations where the model ignores its own retrieved sources.

2-Stage
Separate evaluation of retrieval quality AND generation faithfulness — independently scored
5 Metrics
Context Precision, Recall, Faithfulness, Answer Relevance, Citation Accuracy
Enterprise
Targeted at enterprise knowledge bases, legal, financial, and HR AI assistants
Free Audit
50 queries evaluated at no cost in 5 working days
Scroll
Faithfulness EvaluationContext PrecisionContext RecallCitation AccuracyAnswer RelevanceHallucination DetectionRetrieval QualityEnterprise RAGFaithfulness EvaluationContext PrecisionContext RecallCitation Accuracy
What It Is

Your RAG system has two places it can fail — we measure both

Retrieval-Augmented Generation (RAG) connects a language model to an external knowledge base. When it works, it grounds the model's responses in verified, up-to-date information. When it fails — and it fails in two distinct ways — the result is confident wrong answers that users trust because they appear well-sourced.

RAG systems fail in two distinct ways. Retrieval failures: the system retrieves the wrong context, or does not retrieve anything relevant. The model then either generates from its parametric knowledge (hallucinating) or accurately confesses ignorance — the latter is actually fine, the former is dangerous. Generation failures: the retrieval correctly surfaced the right context, but the model ignores it, contradicts it, or adds information from outside the retrieved context. This is called faithfulness failure — and it is one of the most trust-destroying failures in enterprise AI.

Automated RAG evaluation metrics (RAGAS, TruLens, LangChain Evaluators) use LLMs to evaluate LLM outputs — which creates a circular evaluation problem. An LLM evaluating whether another LLM faithfully used retrieved context will make the same kinds of errors both models make: it will often agree that a generation is faithful when it actually added external information, and it will sometimes flag faithful responses as unfaithful due to rewording.

Human expert evaluation is the ground truth. Our RAG evaluation service uses domain specialists who actually read the retrieved source documents and the generated answer, and independently judge whether: (1) the retrieval surfaced contextually relevant documents, (2) the generation faithfully used the retrieved content, and (3) the answer correctly cites and attributes its sources. This is expensive per-query — which is why automated metrics handle volume and human evaluation handles the representative sample you actually need to trust.

Why do automated RAG metrics fail?
RAGAS and similar automated evaluation frameworks use GPT-4 or Claude to judge whether a retrieved context was relevant and whether a generation was faithful. These evaluators make systematic errors: they over-rate fluent answers as faithful, they under-penalise extrapolations that go slightly beyond the retrieved context, and they cannot evaluate domain-specific faithfulness (e.g., whether a medical answer correctly interpreted a retrieved clinical guideline). Human evaluators with domain knowledge catch these systematic failures.
What types of RAG systems do you evaluate?
We evaluate any RAG architecture: naive RAG (simple vector retrieval + generation), advanced RAG (reranking, query decomposition, HyDE), modular RAG (multiple retrieval strategies), and agentic RAG (multi-step retrieval with tool use). Domain focus is enterprise: internal knowledge bases (HR policies, technical documentation, SOPs), legal research assistants, financial regulatory assistants, and customer-facing knowledge bots. We require access to your system via API — we do not need internal access to your infrastructure.
RAG Faithfulness Evaluation
✓ FAITHFUL · 68.4%
✗ UNFAITHFUL · 18.2%
~ PARTIAL · 13.4%
● IAA κ: 0.86
▼ RAG FAITHFULNESS BREAKDOWN
0%68% FAITHFUL100%
✓ 3,200 SOURCE-CLAIM PAIRS
RAG Evaluation

Does your RAG system actually use the context?

Human evaluators assess whether retrieval surfaces the right context and whether generation faithfully uses it. Detects faithfulness failures before they reach your enterprise users.

Get a Free Audit →
Live Annotation Interface

RAG Faithfulness Evaluation Tool

Annotators compare retrieved source passages against AI-generated claims, flagging unsupported or contradicted statements to improve RAG system grounding and citation accuracy.

ConcaveLabel Studio — RAG Eval · System: LegalAssist RAG · 3,200 response-source pairs
SOURCE PASSAGE
Section 138 of the Negotiable Instruments Act, 1881 provides that where a cheque drawn by a person for discharge of any liability is returned by the bank unpaid due to insufficient funds, the drawer shall be deemed to have committed an offence.
AI-GENERATED CLAIM
Under Section 138 of the NI Act, a cheque bounce due to insufficient funds constitutes a criminal offence by the drawer.
FAITHFUL ✓
SOURCE PASSAGE
The complainant must give written notice to the drawer within 30 days of receiving information from the bank regarding the return of the cheque as unpaid.
AI-GENERATED CLAIM
The complainant is required to send a legal notice within 15 days of the cheque bounce to initiate Section 138 proceedings.
UNFAITHFUL ✗
SOURCE PASSAGE
The offence under Section 138 is punishable with imprisonment for a term which may extend to two years, or with fine which may extend to twice the amount of the cheque, or with both.
AI-GENERATED CLAIM
A conviction under Section 138 can result in imprisonment or fine, depending on the court's discretion.
PARTIAL ~
Evaluation Dimensions

Five RAG quality metrics, human-verified

🎯
Context Precision
Of all the documents your system retrieved for a given query, what fraction were actually relevant to answering it? Low precision means your retrieval is noisy — the model receives irrelevant information that it must either ignore or risks hallucinating from. Human annotators read each retrieved chunk and judge relevance to the query. Automated metrics score this poorly because relevance is often domain-dependent and context-sensitive.
🔍
Context Recall
Was all the information needed to correctly answer the query actually present in the retrieved context? Low recall means your retrieval missed critical information — the model then either hallucinated the missing information or gave an incomplete answer. Human annotators compare the ground-truth answer (if known) with the retrieved context to identify information gaps in retrieval.
Faithfulness
Does every factual claim in the generated answer actually appear in or follow from the retrieved context? Low faithfulness means the model is adding information from its parametric knowledge — bypassing the RAG grounding entirely and hallucinating with false confidence. Human annotators read both the retrieved context and the generated answer, and mark every claim as supported, unsupported, or contradicted by the context.
💬
Answer Relevance
Does the generated answer actually address what the user asked? Even with faithful, accurate retrieval, models sometimes answer a related but different question — particularly with ambiguous queries. Human annotators score whether the answer satisfies the information need expressed in the query, not just whether it is factually correct in the abstract.
📎
Citation Accuracy
Where your RAG system outputs citations (document names, page numbers, section references), are those citations accurate? Does the cited document actually support the claim it is attached to? Citation hallucination — where a model correctly retrieves a document but then fabricates or misattributes the citation — is a common and particularly trust-damaging failure in enterprise knowledge assistants.
🚫
Refusal Appropriateness
When your RAG system cannot answer because the knowledge base does not contain relevant information, does it correctly say so? Or does it hallucinate an answer? And conversely: does it ever refuse to answer when relevant information was actually retrieved? Human annotators judge whether refusals and confessions of ignorance are appropriate given the available retrieved context.
The Process

From RAG query log to actionable quality metrics

01
Query Sampling & System Access
We work from your RAG system's query logs or a representative evaluation query set. We stratify the query sample across query types (factual, procedural, comparative, ambiguous), query complexity, and domain (if your knowledge base spans multiple domains). We access your RAG system via API to run each query and capture: the query, all retrieved chunks with their source documents, the generated answer, and any citations produced. No internal infrastructure access is required.
Stratified query samplingAPI-based retrieval captureContext + answer logging
02
Automated Pre-Scoring
Our automated pipeline computes initial scores on all 6 metrics using semantic similarity, embedding-based relevance scoring, and LLM-as-judge methods. This provides a fast first-pass estimate that surfaces the queries most likely to contain failures. Automated scoring also handles the retrieval-level metrics (context precision and recall) where embedding similarity is a reliable signal — freeing human review time for the generation-level metrics where automation fails most often.
Semantic similarity scoringLLM-as-judge pre-scoringFailure-likely flagging
03
Domain Expert Human Evaluation
Domain experts read the retrieved chunks and generated answer for each query in the human review sample. They independently score faithfulness (claim by claim), answer relevance, and citation accuracy. For queries flagged as potentially failing by automation, two independent experts evaluate — with disagreements adjudicated by a senior reviewer. This is the most expensive but most reliable evaluation layer — and it is what automated metrics consistently fail to approximate.
Claim-by-claim faithfulnessDomain-matched expertsDouble-evaluation on flagged queries
04
Report & Improvement Recommendations
Delivery includes: a per-query evaluation table with all 6 metric scores, overall system quality scores by metric, failure pattern analysis (where and why is faithfulness lowest? Which query types have worst retrieval recall?), and specific improvement recommendations: chunking strategy changes, retrieval parameter tuning, reranking additions, prompt engineering modifications. Optionally: corrective RLHF pairs where the model generated unfaithful answers, showing faithful alternatives as the preferred response.
Per-query scorecardFailure pattern analysisImprovement recommendationsCorrective RLHF pairs option
What You Get

A complete picture of your RAG system's actual performance

📊
Per-Query Scorecard
Every evaluated query with scores on all 6 metrics, the retrieved contexts, the generated answer, human reviewer notes on specific failures, and the automated vs. human score comparison. Structured JSON and formatted PDF.
📈
System Quality Report
Aggregate scores by metric, failure distribution by query type, domain-specific performance breakdown, specific failure examples with explanation, and comparison between automated and human evaluation (showing where automated metrics diverge from ground truth).
🔧
Improvement Roadmap
Prioritised recommendations for system improvement: retrieval parameter changes, chunking strategy adjustments, reranking additions, prompt engineering modifications, and knowledge base gaps to fill. Estimated impact on key metrics for each recommendation based on our experience with similar RAG configurations.
Pricing

Per-query
evaluation pricing

Priced per evaluated query including retrieval quality and generation faithfulness assessment. Human evaluation sample size scales with total query volume. Free 50-query audit with no commitment.

Get 50 Queries Audited Free →
General enterprise RAG₹400–800 / query
Legal / medical / financial RAG₹800–2,000 / query
Multi-hop / agentic RAG₹1,000–2,500 / query
Corrective RLHF pairs (add-on)₹600–1,500 / pair
Continuous monthly evaluation₹2L – ₹8L / month
Free audit50 queries / ₹0

Get 50 RAG queries evaluated free

Share 50 queries from your RAG system — or we can generate representative ones. We return faithfulness scores, retrieval quality assessment, and a 1-page findings report in 5 working days.