Appendix C: Benchmarking Your RAG System
This appendix provides a comprehensive guide to benchmarking RAG systems. Use this to establish baselines, compare approaches, and measure improvements systematically.
Why Benchmark?
For Product Managers
Benchmarking answers critical business questions:
- How does our system compare to alternatives? Justify build vs buy decisions
- Are we improving? Track progress over time
- Where should we invest? Identify the weakest components
- What is the ROI of changes? Quantify improvement value
For Engineers
Benchmarking provides technical clarity:
- Reproducible comparisons: Eliminate confounding variables
- Component isolation: Test retrieval separate from generation
- Regression detection: Catch degradation before production
- Architecture decisions: Data-driven technology choices
Standard Datasets
BEIR (Benchmarking IR)
BEIR is a heterogeneous benchmark covering 18 datasets across diverse domains.
| Dataset | Domain | Queries | Documents | Task |
|---|---|---|---|---|
| MS MARCO | Web | 6,980 | 8.8M | Passage retrieval |
| TREC-COVID | Biomedical | 50 | 171K | Scientific search |
| NFCorpus | Nutrition | 323 | 3.6K | Expert search |
| NQ | Wikipedia | 3,452 | 2.7M | Question answering |
| HotpotQA | Wikipedia | 7,405 | 5.2M | Multi-hop QA |
| FiQA | Finance | 648 | 57K | Financial QA |
| ArguAna | Arguments | 1,406 | 8.7K | Argument retrieval |
| Touche-2020 | Arguments | 49 | 382K | Argument search |
| CQADupStack | StackExchange | 13,145 | 457K | Duplicate detection |
| Quora | Social | 10,000 | 523K | Duplicate detection |
| DBPedia | Wikipedia | 400 | 4.6M | Entity search |
| SCIDOCS | Scientific | 1,000 | 25K | Citation prediction |
| FEVER | Wikipedia | 6,666 | 5.4M | Fact verification |
| Climate-FEVER | Wikipedia | 1,535 | 5.4M | Climate claims |
| SciFact | Scientific | 300 | 5K | Scientific claims |
When to use BEIR: Evaluating general-purpose embedding models, comparing retrieval approaches across domains.
MS MARCO
The most widely used passage retrieval benchmark.
| Split | Queries | Relevant Passages |
|---|---|---|
| Train | 502,939 | ~1 per query |
| Dev | 6,980 | ~1 per query |
| Eval | 6,837 | Hidden |
Characteristics: - Real Bing queries - Sparse relevance labels (typically 1 relevant passage per query) - Large corpus (8.8M passages)
When to use MS MARCO: Training and evaluating passage retrieval models, especially for web-style queries.
Domain-Specific Datasets
| Domain | Dataset | Description |
|---|---|---|
| Legal | CaseHOLD | Legal case holdings |
| Medical | PubMedQA | Biomedical question answering |
| Code | CodeSearchNet | Code retrieval |
| Finance | FiQA | Financial opinion QA |
| Scientific | SCIDOCS | Scientific document retrieval |
For Product Managers
Choosing the right benchmark: Select datasets that match your domain. A legal RAG system should prioritize CaseHOLD over MS MARCO. Generic benchmarks show general capability; domain benchmarks show production relevance.
Benchmark Methodology
Experimental Setup
from dataclasses import dataclass
from typing import Callable
import time
import numpy as np
@dataclass
class BenchmarkConfig:
"""Configuration for a benchmark run."""
name: str
dataset_path: str
embedding_model: str
retrieval_k: list[int] # e.g., [1, 5, 10, 20]
num_runs: int = 3 # For statistical significance
warmup_queries: int = 100
@dataclass
class BenchmarkResult:
"""Results from a benchmark run."""
config: BenchmarkConfig
metrics: dict[str, float] # metric_name -> value
latencies: list[float] # Per-query latencies
timestamp: str
def run_benchmark(
config: BenchmarkConfig,
retriever: Callable,
dataset: dict,
) -> BenchmarkResult:
"""Run a complete benchmark evaluation."""
# Warmup
for query in dataset["queries"][:config.warmup_queries]:
_ = retriever(query["text"], k=max(config.retrieval_k))
# Collect results
all_results = []
latencies = []
for query in dataset["queries"]:
start = time.perf_counter()
retrieved = retriever(query["text"], k=max(config.retrieval_k))
latencies.append(time.perf_counter() - start)
all_results.append({
"query_id": query["id"],
"retrieved": retrieved,
"relevant": query["relevant_docs"],
})
# Calculate metrics
metrics = calculate_metrics(all_results, config.retrieval_k)
metrics["latency_p50"] = np.percentile(latencies, 50)
metrics["latency_p95"] = np.percentile(latencies, 95)
metrics["latency_p99"] = np.percentile(latencies, 99)
return BenchmarkResult(
config=config,
metrics=metrics,
latencies=latencies,
timestamp=datetime.now().isoformat(),
)
Metrics Calculation
def calculate_metrics(
results: list[dict],
k_values: list[int],
) -> dict[str, float]:
"""Calculate retrieval metrics at multiple k values."""
metrics = {}
for k in k_values:
precisions = []
recalls = []
reciprocal_ranks = []
for result in results:
retrieved_k = set(result["retrieved"][:k])
relevant = set(result["relevant"])
# Precision@k
if len(retrieved_k) > 0:
precision = len(retrieved_k & relevant) / len(retrieved_k)
else:
precision = 0.0
precisions.append(precision)
# Recall@k
if len(relevant) > 0:
recall = len(retrieved_k & relevant) / len(relevant)
else:
recall = 1.0 # No relevant docs means perfect recall
recalls.append(recall)
# Reciprocal rank
rr = 0.0
for i, doc_id in enumerate(result["retrieved"][:k], 1):
if doc_id in relevant:
rr = 1.0 / i
break
reciprocal_ranks.append(rr)
metrics[f"precision@{k}"] = np.mean(precisions)
metrics[f"recall@{k}"] = np.mean(recalls)
metrics[f"mrr@{k}"] = np.mean(reciprocal_ranks)
# NDCG calculation
for k in k_values:
ndcg_scores = []
for result in results:
ndcg = calculate_ndcg(
result["retrieved"][:k],
result["relevant"],
k,
)
ndcg_scores.append(ndcg)
metrics[f"ndcg@{k}"] = np.mean(ndcg_scores)
return metrics
def calculate_ndcg(
retrieved: list[str],
relevant: set[str],
k: int,
) -> float:
"""Calculate NDCG@k for a single query."""
# DCG
dcg = 0.0
for i, doc_id in enumerate(retrieved[:k], 1):
rel = 1.0 if doc_id in relevant else 0.0
dcg += rel / np.log2(i + 1)
# Ideal DCG
ideal_rels = [1.0] * min(len(relevant), k)
ideal_rels.extend([0.0] * (k - len(ideal_rels)))
idcg = 0.0
for i, rel in enumerate(ideal_rels, 1):
idcg += rel / np.log2(i + 1)
if idcg == 0:
return 0.0
return dcg / idcg
Statistical Significance
from scipy import stats
def compare_systems(
results_a: list[BenchmarkResult],
results_b: list[BenchmarkResult],
metric: str,
alpha: float = 0.05,
) -> dict:
"""Compare two systems with statistical significance testing."""
scores_a = [r.metrics[metric] for r in results_a]
scores_b = [r.metrics[metric] for r in results_b]
# Paired t-test (if same queries)
t_stat, p_value = stats.ttest_rel(scores_a, scores_b)
# Effect size (Cohen's d)
diff = np.array(scores_a) - np.array(scores_b)
cohens_d = np.mean(diff) / np.std(diff)
# Confidence interval for difference
mean_diff = np.mean(diff)
se = stats.sem(diff)
ci = stats.t.interval(
1 - alpha,
len(diff) - 1,
loc=mean_diff,
scale=se,
)
return {
"mean_a": np.mean(scores_a),
"mean_b": np.mean(scores_b),
"difference": mean_diff,
"p_value": p_value,
"significant": p_value < alpha,
"cohens_d": cohens_d,
"confidence_interval": ci,
}
Running Your Own Benchmarks
Step 1: Define Your Evaluation Set
For Product Managers
Work with domain experts to create evaluation queries that represent real user needs:
- Sample production queries (if available)
- Interview users about their search patterns
- Identify edge cases that matter for your domain
- Balance query types (simple lookups, complex reasoning, multi-hop)
@dataclass
class EvaluationQuery:
"""A single evaluation query with ground truth."""
id: str
text: str
relevant_docs: list[str] # Document IDs
category: str # For segmented analysis
difficulty: str # easy, medium, hard
source: str # production, synthetic, expert
def create_evaluation_set(
queries: list[dict],
documents: list[dict],
labeling_strategy: str = "expert",
) -> list[EvaluationQuery]:
"""Create an evaluation set with relevance labels."""
evaluation_queries = []
for query in queries:
if labeling_strategy == "expert":
# Manual labeling by domain experts
relevant = get_expert_labels(query, documents)
elif labeling_strategy == "synthetic":
# Generate queries from documents
relevant = [query["source_doc"]]
elif labeling_strategy == "click":
# Use click data as proxy for relevance
relevant = get_clicked_docs(query["id"])
evaluation_queries.append(EvaluationQuery(
id=query["id"],
text=query["text"],
relevant_docs=relevant,
category=query.get("category", "unknown"),
difficulty=query.get("difficulty", "medium"),
source=labeling_strategy,
))
return evaluation_queries
Step 2: Establish Baselines
def establish_baselines(
evaluation_set: list[EvaluationQuery],
documents: list[dict],
) -> dict[str, BenchmarkResult]:
"""Run baseline retrievers for comparison."""
baselines = {}
# BM25 baseline
bm25_retriever = create_bm25_retriever(documents)
baselines["bm25"] = run_benchmark(
config=BenchmarkConfig(name="BM25", ...),
retriever=bm25_retriever,
dataset=evaluation_set,
)
# Dense retrieval baseline
dense_retriever = create_dense_retriever(
documents,
model="all-MiniLM-L6-v2",
)
baselines["dense_minilm"] = run_benchmark(
config=BenchmarkConfig(name="Dense-MiniLM", ...),
retriever=dense_retriever,
dataset=evaluation_set,
)
# OpenAI embeddings baseline
openai_retriever = create_dense_retriever(
documents,
model="text-embedding-3-small",
)
baselines["openai_small"] = run_benchmark(
config=BenchmarkConfig(name="OpenAI-Small", ...),
retriever=openai_retriever,
dataset=evaluation_set,
)
return baselines
Step 3: Run Comparative Experiments
def run_experiment(
name: str,
retriever: Callable,
evaluation_set: list[EvaluationQuery],
baselines: dict[str, BenchmarkResult],
num_runs: int = 3,
) -> dict:
"""Run an experiment and compare to baselines."""
# Run multiple times for statistical significance
results = []
for run in range(num_runs):
result = run_benchmark(
config=BenchmarkConfig(name=name, ...),
retriever=retriever,
dataset=evaluation_set,
)
results.append(result)
# Compare to each baseline
comparisons = {}
for baseline_name, baseline_result in baselines.items():
comparison = compare_systems(
results,
[baseline_result] * num_runs,
metric="ndcg@10",
)
comparisons[baseline_name] = comparison
return {
"results": results,
"comparisons": comparisons,
"summary": summarize_experiment(results, comparisons),
}
Step 4: Analyze Results by Segment
def analyze_by_segment(
results: list[dict],
evaluation_set: list[EvaluationQuery],
) -> dict[str, dict]:
"""Analyze performance by query category."""
# Group queries by category
categories = {}
for query in evaluation_set:
if query.category not in categories:
categories[query.category] = []
categories[query.category].append(query.id)
# Calculate metrics per category
segment_metrics = {}
for category, query_ids in categories.items():
category_results = [
r for r in results
if r["query_id"] in query_ids
]
segment_metrics[category] = calculate_metrics(
category_results,
k_values=[1, 5, 10],
)
return segment_metrics
Benchmark Reporting
Standard Report Format
def generate_benchmark_report(
experiment_name: str,
results: list[BenchmarkResult],
baselines: dict[str, BenchmarkResult],
segment_analysis: dict,
) -> str:
"""Generate a standardized benchmark report."""
report = f"""
# Benchmark Report: {experiment_name}
## Summary
| Metric | Value | vs BM25 | vs Dense |
|--------|-------|---------|----------|
| NDCG@10 | {results[0].metrics['ndcg@10']:.3f} | +{delta_bm25:.1%} | +{delta_dense:.1%} |
| Recall@10 | {results[0].metrics['recall@10']:.3f} | +{delta_bm25_r:.1%} | +{delta_dense_r:.1%} |
| P95 Latency | {results[0].metrics['latency_p95']*1000:.1f}ms | - | - |
## Statistical Significance
All improvements significant at p < 0.05 (paired t-test).
## Segment Analysis
| Category | NDCG@10 | Recall@10 | Count |
|----------|---------|-----------|-------|
{segment_table}
## Recommendations
{generate_recommendations(results, segment_analysis)}
"""
return report
For Product Managers
Key metrics to track:
- NDCG@10: Overall ranking quality
- Recall@10: Coverage of relevant documents
- P95 Latency: User experience impact
- Segment performance: Where are we weakest?
Questions to answer:
- Are we better than baselines? By how much?
- Is the improvement statistically significant?
- Which query types benefit most/least?
- What is the latency cost of improvements?
Common Pitfalls
PM Pitfall
Benchmark shopping: Choosing benchmarks that make your system look good rather than benchmarks that reflect your use case. Always include domain-relevant benchmarks alongside standard ones.
Engineering Pitfall
Overfitting to benchmarks: Optimizing specifically for benchmark queries rather than general retrieval quality. Use held-out test sets and production sampling to detect this.
PM Pitfall
Ignoring latency: A 10% NDCG improvement that doubles latency may not be worth it. Always report latency alongside quality metrics.
Engineering Pitfall
Single-run comparisons: Running each system once and comparing results. Always run multiple times and report statistical significance.
Quick Reference
Minimum Viable Benchmark
For quick comparisons, use this minimal setup:
- 100+ evaluation queries with relevance labels
- BM25 baseline (always include)
- One dense baseline (e.g., all-MiniLM-L6-v2)
- 3 runs for statistical significance
- Report NDCG@10, Recall@10, P95 latency
Comprehensive Benchmark
For thorough evaluation:
- 500+ evaluation queries across categories
- Multiple baselines (BM25, 2-3 dense models)
- 5+ runs per configuration
- Segment analysis by query type
- Statistical significance testing
- Latency profiling at multiple percentiles
Navigation
- Previous: Appendix B: Algorithms Reference - Algorithm pseudocode and complexity
- Next: Appendix D: Debugging RAG Systems - Systematic debugging methodology
- Reference: Glossary | Quick Reference
- Book Index: Book Overview