Service - Enterprise AI

RAG System Evaluation

Human evaluation of retrieval + generation quality in RAG pipelines. Annotators assess: did the retrieval surface the right context? Did the generation faithfully use it? Detects faithfulness failures, citation errors, and hallucinations where the model ignores its own retrieved sources.

2-Stage
Separate evaluation of retrieval quality AND generation faithfulness independently scored
5 Metrics
Context Precision, Recall, Faithfulness, Answer Relevance, Citation Accuracy
Enterprise
Targeted at enterprise knowledge bases, legal, financial, and HR AI assistants
Free Audit
50 queries evaluated at no cost in 5 working days
Scroll
Faithfulness EvaluationContext PrecisionContext RecallCitation AccuracyAnswer RelevanceHallucination DetectionRetrieval QualityEnterprise RAGFaithfulness EvaluationContext PrecisionContext RecallCitation Accuracy
RAG Faithfulness Evaluation
✓ FAITHFUL · 68.4%
✗ UNFAITHFUL · 18.2%
~ PARTIAL · 13.4%
● IAA κ: 0.86
▼ RAG FAITHFULNESS BREAKDOWN
0%68% FAITHFUL100%
✓ 3,200 SOURCE-CLAIM PAIRS
RAG Evaluation - What It Is

Does your RAG system actually use the context?

Retrieval-Augmented Generation (RAG) connects a language model to an external knowledge base. When it works, it grounds the model's responses in verified, up-to-date information. When it fails, it does that in two distinct ways either retrives the wrong context with confidence such that users trust because they appear well-sourced, or does not retrieve anything relevant. Assess whether retrieval surfaces the right context and whether generation faithfully uses it by detecting faithfulness failures before they reach your enterprise users.

Get a Free Audit →
Live Annotation Interface

RAG Faithfulness Evaluation Tool

Annotators compare retrieved source passages against AI-generated claims, flagging unsupported or contradicted statements to improve RAG system grounding and citation accuracy.

ConcaveLabel Studio RAG Eval · System: LegalAssist RAG · 3,200 response-source pairs
SOURCE PASSAGE
Section 138 of the Negotiable Instruments Act, 1881 provides that where a cheque drawn by a person for discharge of any liability is returned by the bank unpaid due to insufficient funds, the drawer shall be deemed to have committed an offence.
AI-GENERATED CLAIM
Under Section 138 of the NI Act, a cheque bounce due to insufficient funds constitutes a criminal offence by the drawer.
FAITHFUL ✓
SOURCE PASSAGE
The complainant must give written notice to the drawer within 30 days of receiving information from the bank regarding the return of the cheque as unpaid.
AI-GENERATED CLAIM
The complainant is required to send a legal notice within 15 days of the cheque bounce to initiate Section 138 proceedings.
UNFAITHFUL ✗
SOURCE PASSAGE
The offence under Section 138 is punishable with imprisonment for a term which may extend to two years, or with fine which may extend to twice the amount of the cheque, or with both.
AI-GENERATED CLAIM
A conviction under Section 138 can result in imprisonment or fine, depending on the court's discretion.
PARTIAL ~
How It Works

Three things the pipeline does on every RAG evaluation project

Three-component RAG decomposition
Retrieval quality, grounding faithfulness, and response accuracy evaluated independently. A high-scoring retrieval system can still produce unfaithful responses as the pipeline measures all three separately so you know exactly where failures occur.
Claim-to-source attribution
Every factual claim in the response traced to a specific passage in the retrieved context. Unsupported claims, hallucinated citations, and faithful summaries classified at the claim level, not as a holistic faithfulness score.
Published faithfulness kappa on every batch
Cohen's kappa calculated separately for faithfulness and factual accuracy where your QA report shows whether annotators agree on what counts as grounded, not just an aggregate score that masks annotator-level divergence.
Pipeline Capabilities

What the infrastructure delivers

Retrieval Quality Scoring
Each question-context pair receives faithfulness and relevance scores, identifying where the retriever fails before it can corrupt generation quality downstream.
Counterfactual Contrast Pairs
The pipeline generates conflicting context variants so models learn to distinguish grounded answers from plausible-but-wrong confabulations under retrieval pressure.
Pipeline-Integrated Delivery
Datasets delivered in standard formats—JSONL, Parquet, HuggingFace-ready—with schema documentation and loader code included for immediate use.
What You Get

A complete picture of your RAG system's actual performance

Per-Query Scorecard
Every evaluated query with scores on all 6 metrics, the retrieved contexts, the generated answer, human reviewer notes on specific failures, and the automated vs. human score comparison. Structured JSON and formatted PDF.
System Quality Report
Aggregate scores by metric, failure distribution by query type, domain-specific performance breakdown, specific failure examples with explanation, and comparison between automated and human evaluation (showing where automated metrics diverge from ground truth).
Improvement Roadmap
Prioritised recommendations for system improvement: retrieval parameter changes, chunking strategy adjustments, reranking additions, prompt engineering modifications, and knowledge base gaps to fill. Estimated impact on key metrics for each recommendation based on our experience with similar RAG configurations.
Pricing

Per-query
evaluation pricing

Priced per evaluated query including retrieval quality and generation faithfulness assessment. Human evaluation sample size scales with total query volume. Free 50-query audit with no commitment.

Get 50 Queries Audited Free →
General enterprise RAG$5–10 / query
Legal / medical / financial RAG$10–24 / query
Multi-hop / agentic RAG$12–30 / query
Corrective RLHF pairs (add-on)$7–18 / pair
Continuous monthly evaluation$2.5K – $10K / month
Free audit50 queries / $0

Get 50 RAG queries evaluated free

Share 50 queries from your RAG system or we can generate representative ones. We return faithfulness scores, retrieval quality assessment, and a 1-page findings report in 5 working days.