Service - Enterprise AI

RAG System Evaluation

Human evaluation of retrieval + generation quality in RAG pipelines. Annotators assess: did the retrieval surface the right context? Did the generation faithfully use it? Detects faithfulness failures, citation errors, and hallucinations where the model ignores its own retrieved sources.

Get 50 Queries Evaluated Free → View Pricing

2-Stage

Separate evaluation of retrieval quality AND generation faithfulness independently scored

5 Metrics

Context Precision, Recall, Faithfulness, Answer Relevance, Citation Accuracy

Enterprise

Targeted at enterprise knowledge bases, legal, financial, and HR AI assistants

Free Audit

50 queries evaluated at no cost in 5 working days

Scroll

✓ FAITHFUL · 68.4%

✗ UNFAITHFUL · 18.2%

~ PARTIAL · 13.4%

● IAA κ: 0.86

▼ RAG FAITHFULNESS BREAKDOWN

0%68% FAITHFUL100%

✓ 3,200 SOURCE-CLAIM PAIRS

RAG Evaluation - What It Is

Does your RAG system actually use the context?

Retrieval-Augmented Generation (RAG) connects a language model to an external knowledge base. When it works, it grounds the model's responses in verified, up-to-date information. When it fails, it does that in two distinct ways either retrives the wrong context with confidence such that users trust because they appear well-sourced, or does not retrieve anything relevant. Assess whether retrieval surfaces the right context and whether generation faithfully uses it by detecting faithfulness failures before they reach your enterprise users.

Get a Free Audit →

Live Annotation Interface

RAG Faithfulness Evaluation Tool

Annotators compare retrieved source passages against AI-generated claims, flagging unsupported or contradicted statements to improve RAG system grounding and citation accuracy.

ConcaveLabel Studio RAG Eval · System: LegalAssist RAG · 3,200 response-source pairs

SOURCE PASSAGE

Section 138 of the Negotiable Instruments Act, 1881 provides that where a cheque drawn by a person for discharge of any liability is returned by the bank unpaid due to insufficient funds, the drawer shall be deemed to have committed an offence.

AI-GENERATED CLAIM

Under Section 138 of the NI Act, a cheque bounce due to insufficient funds constitutes a criminal offence by the drawer.

FAITHFUL ✓

SOURCE PASSAGE

The complainant must give written notice to the drawer within 30 days of receiving information from the bank regarding the return of the cheque as unpaid.

AI-GENERATED CLAIM

The complainant is required to send a legal notice within 15 days of the cheque bounce to initiate Section 138 proceedings.

UNFAITHFUL ✗

SOURCE PASSAGE

The offence under Section 138 is punishable with imprisonment for a term which may extend to two years, or with fine which may extend to twice the amount of the cheque, or with both.

AI-GENERATED CLAIM

A conviction under Section 138 can result in imprisonment or fine, depending on the court's discretion.

PARTIAL ~

How It Works

Three things the pipeline does on every RAG evaluation project

Three-component RAG decomposition

Retrieval quality, grounding faithfulness, and response accuracy evaluated independently. A high-scoring retrieval system can still produce unfaithful responses as the pipeline measures all three separately so you know exactly where failures occur.

Claim-to-source attribution

Every factual claim in the response traced to a specific passage in the retrieved context. Unsupported claims, hallucinated citations, and faithful summaries classified at the claim level, not as a holistic faithfulness score.

Published faithfulness kappa on every batch

Cohen's kappa calculated separately for faithfulness and factual accuracy where your QA report shows whether annotators agree on what counts as grounded, not just an aggregate score that masks annotator-level divergence.

Pipeline Capabilities

What the infrastructure delivers

Retrieval Quality Scoring

Each question-context pair receives faithfulness and relevance scores, identifying where the retriever fails before it can corrupt generation quality downstream.

Counterfactual Contrast Pairs

The pipeline generates conflicting context variants so models learn to distinguish grounded answers from plausible-but-wrong confabulations under retrieval pressure.

Pipeline-Integrated Delivery

Datasets delivered in standard formats—JSONL, Parquet, HuggingFace-ready—with schema documentation and loader code included for immediate use.

What You Get

A complete picture of your RAG system's actual performance

Per-Query Scorecard

Every evaluated query with scores on all 6 metrics, the retrieved contexts, the generated answer, human reviewer notes on specific failures, and the automated vs. human score comparison. Structured JSON and formatted PDF.

System Quality Report

Aggregate scores by metric, failure distribution by query type, domain-specific performance breakdown, specific failure examples with explanation, and comparison between automated and human evaluation (showing where automated metrics diverge from ground truth).

Improvement Roadmap

Prioritised recommendations for system improvement: retrieval parameter changes, chunking strategy adjustments, reranking additions, prompt engineering modifications, and knowledge base gaps to fill. Estimated impact on key metrics for each recommendation based on our experience with similar RAG configurations.

Pricing

Per-query
evaluation pricing

Priced per evaluated query including retrieval quality and generation faithfulness assessment. Human evaluation sample size scales with total query volume. Free 50-query audit with no commitment.

Get 50 Queries Audited Free →

General enterprise RAG$5–10 / query

Legal / medical / financial RAG$10–24 / query

Multi-hop / agentic RAG$12–30 / query

Corrective RLHF pairs (add-on)$7–18 / pair

Continuous monthly evaluation$2.5K – $10K / month

Free audit50 queries / $0

Solutions that complement RAG evaluation

Get 50 RAG queries evaluated free

Share 50 queries from your RAG system or we can generate representative ones. We return faithfulness scores, retrieval quality assessment, and a 1-page findings report in 5 working days.

Start Free RAG Audit → Talk to our ML team

RAG System Evaluation

Does your RAG system actually use the context?

RAG Faithfulness Evaluation Tool

Three things the pipeline does on every RAG evaluation project

What the infrastructure delivers

A complete picture of your RAG system's actual performance

Per-queryevaluation pricing

Solutions that complement RAG evaluation

Get 50 RAG queries evaluated free

Per-query
evaluation pricing