EvalGate

What is RAG?

Retrieval-Augmented Generation (RAG) combines information retrieval with LLM generation. Instead of relying solely on the model's training data, RAG systems retrieve relevant documents from a knowledge base and use them as context for generating responses.

Why RAG Evaluation is Challenging

RAG systems have multiple failure points:

Retrieval: Did we find the right documents?
Relevance: Is the retrieved context useful?
Generation: Did the LLM use the context correctly?
Grounding: Is the answer supported by retrieved docs?

Each component must be evaluated separately and holistically.

Evaluation Framework

1. Retrieval Quality

Are you retrieving the right documents?

Metrics:

Precision@K: Of the top K retrieved docs, how munknown are relevant?
Recall@K: Of all relevant docs, how munknown are in top K?
MRR (Mean Reciprocal Rank): Position of first relevant document
NDCG (Normalized Discounted Cumulative Gain): Quality of ranking

Evaluation Method:

Test Case:

Query: "What is our refund policy for damaged items?"

Gold Standard: [doc_42, doc_87, doc_103]

Retrieved: [doc_42, doc_91, doc_103, doc_45, doc_87]

Precision@3: 2/3 = 67% | Recall@5: 3/3 = 100%

2. Context Relevance

Is the retrieved context actually useful for answering the query?

Metrics:

Context Relevance Score: LLM judges if context helps answer the query (1-5)
Context Precision: % of retrieved chunks that are relevant

LLM Judge Prompt:

"Given this query: [QUERY] and retrieved context: [CONTEXT], rate how relevant this context is for answering the query on a scale of 1-5. A score of 5 means the context directly answers the query. A score of 1 means the context is completely irrelevant."

3. Answer Faithfulness

Is the generated answer grounded in the retrieved context, or is it hallucinating?

Metrics:

Faithfulness Score: % of claims in answer supported by context
Citation Coverage: Are all facts attributed to sources?

Evaluation Approach:

// Use LLM to extract claims from answer const claims = await extractClaims(answer); // Check each claim against retrieved context const supported = await Promise.all( claims.map(claim => isSupported(claim, context)) ); const faithfulness = supported.filter(Boolean).length / claims.length; assert(faithfulness >= 0.95); // 95%+ claims must be supported

4. Answer Relevance

Does the answer actually address the user's query?

Example:

Query: "How long does shipping take?"

Bad (not relevant): "We offer free shipping on orders over $50."

Good: "Standard shipping takes 5-7 business days."

5. Answer Correctness

Is the answer factually accurate compared to ground truth?

Methods:

Exact match: For factual queries (dates, numbers)
Semantic similarity: Compare to reference answer
LLM-as-judge: Evaluate correctness holistically

Building a RAG Test Suite

Step 1: Create Query-Answer Pairs

Collect 100-200 representative queries with gold-standard answers:

Example Test Case:

Query: "What is the maximum file size for uploads?"

Expected Answer: "The maximum file size is 100MB per file"

Relevant Docs: ["docs/upload-limits.md", "docs/faq.md"]

Category: Technical specs

Step 2: Test Retrieval Separately

Before evaluating end-to-end, isolate retrieval:

// Test just the retrieval step const retrieved = await vectorDB.search(query, k=5); // Check if relevant docs were retrieved const relevantDocs = ["docs/upload-limits.md", "docs/faq.md"]; const precision = retrieved.filter(doc => relevantDocs.includes(doc.id) ).length / 5; console.log(`Precision@5: ${precision * 100}%`);

Step 3: Evaluate End-to-End

Run full RAG pipeline and check all quality dimensions:

const result = await ragPipeline.query("What is the max file size?"); // Check retrieval quality assert(result.retrievedDocs.some(doc => doc.id === "docs/upload-limits.md")); // Check answer correctness const correctness = await llmJudge.evaluate(result.answer, expectedAnswer); assert(correctness >= 4); // 4/5 or better // Check faithfulness const faithfulness = await checkFaithfulness(result.answer, result.context); assert(faithfulness >= 0.95);

Common Failure Modes

1. Retrieval Failures

Problem: Query uses different terminology than documents

Query: "How do I reset my password?"

Documents use: "password recovery" not "reset"

Solution: Query expansion, synonyms, hybrid search (keyword + semantic)

2. Context Window Overflow

Retrieved too munknown docs, exceeded LLM context limit.

Solution: Rerank and truncate to most relevant chunks.

3. Answer Hallucination

Context: "Our support team responds within 24 hours on weekdays."

Bad Answer: "Support responds within 24 hours, including weekends."

Issue: LLM added information not in context

Solution: Add instruction: "Only use information from the provided context. If unsure, say so."

4. Incomplete Answers

Relevant information was retrieved but not included in answer.

Solution: Improve generation prompt to cover all relevant points from context.

Advanced Techniques

1. Hybrid Search

Combine semantic search with keyword matching:

// Semantic search results const semanticResults = await vectorDB.search(embedding, k=10); // Keyword search results const keywordResults = await fullTextSearch(query, k=10); // Combine with learned weights const combined = reciprocalRankFusion(semanticResults, keywordResults);

2. Reranking

After initial retrieval, use a reranker to improve ordering:

Cross-encoder models (BERT-based)
LLM-as-reranker
Learned-to-rank models

3. Query Rewriting

Transform user queries into better search queries:

Original: "How do I do that thing with the files?"

Rewritten: "How to upload files? What is the file size limit?"

4. Multi-Hop Retrieval

For complex queries, retrieve iteratively:

Retrieve documents answering first part of query
Use those results to refine query for second retrieval
Combine information from both retrievals

Monitoring Production RAG

Track these metrics continuously:

Answer rate: % of queries answered vs. "I don't know"
User feedback: Thumbs up/down on answers
Retrieval latency: Time to fetch documents
Generation latency: Time to generate answer
Context usage: Are retrieved docs actually being used?

Optimization Strategies

Improve Retrieval

Fine-tune embedding models on your domain
Increase chunk overlap to avoid splitting related information
Add metadata filters (date, category, author)
Use better chunking strategies (semantic, sentence-based)

Improve Generation

Provide clear instructions about using context
Include examples of good citations
Use chain-of-thought to explain reasoning
Add explicit "Don't hallucinate" instructions

Real-World Example

Technical Documentation Q&A

Knowledge Base: 500 documents, 10,000 chunks

Test Suite: 150 queries

Initial Performance:

Precision@3: 58%
Faithfulness: 82%
Answer correctness: 3.2/5

After Optimization:

Added hybrid search + reranking
Fine-tuned retriever on domain data
Improved generation prompts with examples

Final Performance:

Precision@3: 84% (+26pp)
Faithfulness: 96% (+14pp)
Answer correctness: 4.3/5 (+1.1 points)

What is RAG?

Why RAG Evaluation is Challenging

RAG systems have multiple failure points:

Retrieval: Did we find the right documents?
Relevance: Is the retrieved context useful?
Generation: Did the LLM use the context correctly?
Grounding: Is the answer supported by retrieved docs?

Each component must be evaluated separately and holistically.

Evaluation Framework

1. Retrieval Quality

Are you retrieving the right documents?

Metrics:

Precision@K: Of the top K retrieved docs, how munknown are relevant?
Recall@K: Of all relevant docs, how munknown are in top K?
MRR (Mean Reciprocal Rank): Position of first relevant document
NDCG (Normalized Discounted Cumulative Gain): Quality of ranking

Evaluation Method:

Test Case:

Query: "What is our refund policy for damaged items?"

Gold Standard: [doc_42, doc_87, doc_103]

Retrieved: [doc_42, doc_91, doc_103, doc_45, doc_87]

Precision@3: 2/3 = 67% | Recall@5: 3/3 = 100%

2. Context Relevance

Is the retrieved context actually useful for answering the query?

Metrics:

Context Relevance Score: LLM judges if context helps answer the query (1-5)
Context Precision: % of retrieved chunks that are relevant

LLM Judge Prompt:

3. Answer Faithfulness

Is the generated answer grounded in the retrieved context, or is it hallucinating?

Metrics:

Faithfulness Score: % of claims in answer supported by context
Citation Coverage: Are all facts attributed to sources?

Evaluation Approach:

4. Answer Relevance

Does the answer actually address the user's query?

Example:

Query: "How long does shipping take?"

Bad (not relevant): "We offer free shipping on orders over $50."

Good: "Standard shipping takes 5-7 business days."

5. Answer Correctness

Is the answer factually accurate compared to ground truth?

Methods:

Exact match: For factual queries (dates, numbers)
Semantic similarity: Compare to reference answer
LLM-as-judge: Evaluate correctness holistically

Building a RAG Test Suite

Step 1: Create Query-Answer Pairs

Collect 100-200 representative queries with gold-standard answers:

Example Test Case:

Query: "What is the maximum file size for uploads?"

Expected Answer: "The maximum file size is 100MB per file"

Relevant Docs: ["docs/upload-limits.md", "docs/faq.md"]

Category: Technical specs

Step 2: Test Retrieval Separately

Before evaluating end-to-end, isolate retrieval:

Step 3: Evaluate End-to-End

Run full RAG pipeline and check all quality dimensions:

Common Failure Modes

1. Retrieval Failures

Problem: Query uses different terminology than documents

Query: "How do I reset my password?"

Documents use: "password recovery" not "reset"

Solution: Query expansion, synonyms, hybrid search (keyword + semantic)

2. Context Window Overflow

Retrieved too munknown docs, exceeded LLM context limit.

Solution: Rerank and truncate to most relevant chunks.

3. Answer Hallucination

Context: "Our support team responds within 24 hours on weekdays."

Bad Answer: "Support responds within 24 hours, including weekends."

Issue: LLM added information not in context

Solution: Add instruction: "Only use information from the provided context. If unsure, say so."

4. Incomplete Answers

Relevant information was retrieved but not included in answer.

Solution: Improve generation prompt to cover all relevant points from context.

Advanced Techniques

1. Hybrid Search

Combine semantic search with keyword matching:

2. Reranking

After initial retrieval, use a reranker to improve ordering:

Cross-encoder models (BERT-based)
LLM-as-reranker
Learned-to-rank models

3. Query Rewriting

Transform user queries into better search queries:

Original: "How do I do that thing with the files?"

Rewritten: "How to upload files? What is the file size limit?"

4. Multi-Hop Retrieval

For complex queries, retrieve iteratively:

Retrieve documents answering first part of query
Use those results to refine query for second retrieval
Combine information from both retrievals

Monitoring Production RAG

Track these metrics continuously:

Answer rate: % of queries answered vs. "I don't know"
User feedback: Thumbs up/down on answers
Retrieval latency: Time to fetch documents
Generation latency: Time to generate answer
Context usage: Are retrieved docs actually being used?

Optimization Strategies

Improve Retrieval

Fine-tune embedding models on your domain
Increase chunk overlap to avoid splitting related information
Add metadata filters (date, category, author)
Use better chunking strategies (semantic, sentence-based)

Improve Generation

Provide clear instructions about using context
Include examples of good citations
Use chain-of-thought to explain reasoning
Add explicit "Don't hallucinate" instructions

Real-World Example

Technical Documentation Q&A

Knowledge Base: 500 documents, 10,000 chunks

Test Suite: 150 queries

Initial Performance:

Precision@3: 58%
Faithfulness: 82%
Answer correctness: 3.2/5

After Optimization:

Added hybrid search + reranking
Fine-tuned retriever on domain data
Improved generation prompts with examples

Final Performance:

Precision@3: 84% (+26pp)
Faithfulness: 96% (+14pp)
Answer correctness: 4.3/5 (+1.1 points)

RAG Evaluation Guide

What is RAG?

Why RAG Evaluation is Challenging

Evaluation Framework

1. Retrieval Quality

Metrics:

Evaluation Method:

2. Context Relevance

Metrics:

LLM Judge Prompt:

3. Answer Faithfulness

Metrics:

Evaluation Approach:

4. Answer Relevance

Example:

5. Answer Correctness

Methods:

Building a RAG Test Suite

Step 1: Create Query-Answer Pairs

Step 2: Test Retrieval Separately

Step 3: Evaluate End-to-End

Common Failure Modes

1. Retrieval Failures

2. Context Window Overflow

3. Answer Hallucination

4. Incomplete Answers

Advanced Techniques

1. Hybrid Search

2. Reranking

3. Query Rewriting

4. Multi-Hop Retrieval

Monitoring Production RAG

Optimization Strategies

Improve Retrieval

Improve Generation

Real-World Example

Technical Documentation Q&A

Related Guides

RAG Evaluation Guide

What is RAG?

Why RAG Evaluation is Challenging

Evaluation Framework

1. Retrieval Quality

Metrics:

Evaluation Method:

2. Context Relevance

Metrics:

LLM Judge Prompt:

3. Answer Faithfulness

Metrics:

Evaluation Approach:

4. Answer Relevance

Example:

5. Answer Correctness

Methods:

Building a RAG Test Suite

Step 1: Create Query-Answer Pairs

Step 2: Test Retrieval Separately

Step 3: Evaluate End-to-End

Common Failure Modes

1. Retrieval Failures

2. Context Window Overflow

3. Answer Hallucination

4. Incomplete Answers

Advanced Techniques

1. Hybrid Search

2. Reranking

3. Query Rewriting

4. Multi-Hop Retrieval

Monitoring Production RAG

Optimization Strategies

Improve Retrieval

Improve Generation

Real-World Example

Technical Documentation Q&A

Related Guides