RAG Evaluation Guide
Learn how to evaluate Retrieval-Augmented Generation systems for accuracy and relevance.
What is RAG?
Retrieval-Augmented Generation (RAG) combines information retrieval with LLM generation. Instead of relying solely on the model's training data, RAG systems retrieve relevant documents from a knowledge base and use them as context for generating responses.
Why RAG Evaluation is Challenging
RAG systems have multiple failure points:
- Retrieval: Did we find the right documents?
- Relevance: Is the retrieved context useful?
- Generation: Did the LLM use the context correctly?
- Grounding: Is the answer supported by retrieved docs?
Each component must be evaluated separately and holistically.
Evaluation Framework
1. Retrieval Quality
Are you retrieving the right documents?
Metrics:
- Precision@K: Of the top K retrieved docs, how munknown are relevant?
- Recall@K: Of all relevant docs, how munknown are in top K?
- MRR (Mean Reciprocal Rank): Position of first relevant document
- NDCG (Normalized Discounted Cumulative Gain): Quality of ranking
Evaluation Method:
Test Case:
Query: "What is our refund policy for damaged items?"
Gold Standard: [doc_42, doc_87, doc_103]
Retrieved: [doc_42, doc_91, doc_103, doc_45, doc_87]
Precision@3: 2/3 = 67% | Recall@5: 3/3 = 100%
2. Context Relevance
Is the retrieved context actually useful for answering the query?
Metrics:
- Context Relevance Score: LLM judges if context helps answer the query (1-5)
- Context Precision: % of retrieved chunks that are relevant
LLM Judge Prompt:
"Given this query: [QUERY] and retrieved context: [CONTEXT], rate how relevant this context is for answering the query on a scale of 1-5. A score of 5 means the context directly answers the query. A score of 1 means the context is completely irrelevant."
3. Answer Faithfulness
Is the generated answer grounded in the retrieved context, or is it hallucinating?
Metrics:
- Faithfulness Score: % of claims in answer supported by context
- Citation Coverage: Are all facts attributed to sources?
Evaluation Approach:
4. Answer Relevance
Does the answer actually address the user's query?
Example:
Query: "How long does shipping take?"
Bad (not relevant): "We offer free shipping on orders over $50."
Good: "Standard shipping takes 5-7 business days."
5. Answer Correctness
Is the answer factually accurate compared to ground truth?
Methods:
- Exact match: For factual queries (dates, numbers)
- Semantic similarity: Compare to reference answer
- LLM-as-judge: Evaluate correctness holistically
Building a RAG Test Suite
Step 1: Create Query-Answer Pairs
Collect 100-200 representative queries with gold-standard answers:
Example Test Case:
Query: "What is the maximum file size for uploads?"
Expected Answer: "The maximum file size is 100MB per file"
Relevant Docs: ["docs/upload-limits.md", "docs/faq.md"]
Category: Technical specs
Step 2: Test Retrieval Separately
Before evaluating end-to-end, isolate retrieval:
Step 3: Evaluate End-to-End
Run full RAG pipeline and check all quality dimensions:
Common Failure Modes
1. Retrieval Failures
Problem: Query uses different terminology than documents
Query: "How do I reset my password?"
Documents use: "password recovery" not "reset"
Solution: Query expansion, synonyms, hybrid search (keyword + semantic)
2. Context Window Overflow
Retrieved too munknown docs, exceeded LLM context limit.
Solution: Rerank and truncate to most relevant chunks.
3. Answer Hallucination
Context: "Our support team responds within 24 hours on weekdays."
Bad Answer: "Support responds within 24 hours, including weekends."
Issue: LLM added information not in context
Solution: Add instruction: "Only use information from the provided context. If unsure, say so."
4. Incomplete Answers
Relevant information was retrieved but not included in answer.
Solution: Improve generation prompt to cover all relevant points from context.
Advanced Techniques
1. Hybrid Search
Combine semantic search with keyword matching:
2. Reranking
After initial retrieval, use a reranker to improve ordering:
- Cross-encoder models (BERT-based)
- LLM-as-reranker
- Learned-to-rank models
3. Query Rewriting
Transform user queries into better search queries:
Original: "How do I do that thing with the files?"
Rewritten: "How to upload files? What is the file size limit?"
4. Multi-Hop Retrieval
For complex queries, retrieve iteratively:
- Retrieve documents answering first part of query
- Use those results to refine query for second retrieval
- Combine information from both retrievals
Monitoring Production RAG
Track these metrics continuously:
- Answer rate: % of queries answered vs. "I don't know"
- User feedback: Thumbs up/down on answers
- Retrieval latency: Time to fetch documents
- Generation latency: Time to generate answer
- Context usage: Are retrieved docs actually being used?
Optimization Strategies
Improve Retrieval
- Fine-tune embedding models on your domain
- Increase chunk overlap to avoid splitting related information
- Add metadata filters (date, category, author)
- Use better chunking strategies (semantic, sentence-based)
Improve Generation
- Provide clear instructions about using context
- Include examples of good citations
- Use chain-of-thought to explain reasoning
- Add explicit "Don't hallucinate" instructions
Real-World Example
Technical Documentation Q&A
Knowledge Base: 500 documents, 10,000 chunks
Test Suite: 150 queries
Initial Performance:
- Precision@3: 58%
- Faithfulness: 82%
- Answer correctness: 3.2/5
After Optimization:
- Added hybrid search + reranking
- Fine-tuned retriever on domain data
- Improved generation prompts with examples
Final Performance:
- Precision@3: 84% (+26pp)
- Faithfulness: 96% (+14pp)
- Answer correctness: 4.3/5 (+1.1 points)