Quick Reference

A condensed reference for the key concepts, metrics, decision frameworks, and checklists from the book. Use this as a quick lookup when building and improving RAG systems.

Chapter Summaries

Chapter 0: Introduction - The Product Mindset

Core Concept: Treat RAG as a product that improves continuously, not a project that ships once.

Key Takeaways:

The improvement flywheel: Measure → Analyze → Improve → Deploy → Repeat
Distinguish inventory problems (missing data) from capability problems (cannot find existing data)
Embeddings capture meaning; vector databases enable fast similarity search
Semantic search finds meaning; lexical search finds exact terms; hybrid combines both

Chapter 1: Evaluation-First Development

Core Concept: You cannot improve what you cannot measure—and you can measure before you have users.

Key Takeaways:

Leading metrics (experiment velocity) predict success; lagging metrics (satisfaction) measure it
Prioritize recall over precision with modern LLMs—they handle irrelevant context well
Synthetic data bootstraps evaluation before real users arrive
Statistical significance requires proper sample sizes (typically 200-400 examples)

Chapter 2: Training Data and Fine-Tuning

Core Concept: Fine-tune embedding models (cheap, fast) not language models (expensive, complex).

Key Takeaways:

Bi-encoders are fast (precomputed); cross-encoders are accurate (computed per pair)
Re-rankers give 12-20% improvement with no training; fine-tuning gives 6-10% additional improvement
Hard negatives are the most valuable training examples—mine them from retrieval failures
6,000+ examples enable effective fine-tuning; fewer examples work better for re-rankers

Chapter 3: Feedback Systems and UX

Core Concept: Feedback is the fuel for the improvement flywheel—collect it intentionally.

Key Takeaways:

Specific feedback prompts ("Did we answer your question?") get 5x more responses
Implicit signals (query refinement, abandonment) reveal failures explicit feedback misses
Streaming reduces perceived latency; citations build trust
Negative feedback requires follow-up to be actionable

Chapter 4: Query Understanding and Prioritization

Core Concept: Not all queries are equal—prioritize improvements by volume, satisfaction gap, and strategic value.

Key Takeaways:

Cluster queries by embedding similarity to discover patterns
Prioritization score = Volume x (1 - Satisfaction) x Achievable Delta x Strategic Relevance
High volume + low satisfaction = fix first; low volume + high satisfaction = maintain
Topic modeling reveals what users actually ask about vs what you expected

Chapter 5: Specialized Retrieval Systems

Core Concept: One retriever cannot excel at everything—build specialized systems for different content types.

Key Takeaways:

RAPTOR creates hierarchical summaries for long documents
Metadata extraction enables filtering before semantic search
Synthetic text generation describes non-text content (images, tables) for embedding
Multimodal retrieval requires unified or specialized embedding models

Chapter 6: Query Routing and Orchestration

Core Concept: Success = P(selecting right retriever) x P(retriever finding data).

Key Takeaways:

Few-shot classification: 10 examples = 85%, 40 examples = 95% accuracy
Three router architectures: classifier-based (fast), embedding-based (flexible), LLM-based (powerful)
Tools-as-APIs pattern enables parallel team development
Avoid data leakage: never include test examples in few-shot prompts

Chapter 7: Production Operations

Core Concept: Shipping is the starting line—production success requires cost-aware design and graceful degradation.

Key Takeaways:

LLM generation is 60-75% of costs; optimize context size first
Write-time computation for stable content; read-time for dynamic content
Semantic caching returns similar (not just identical) query results
Monitor retrieval metrics, not just generation quality

Chapter 8: Hybrid Search

Core Concept: Semantic search fails on exact terms and rare vocabulary—hybrid search combines the best of both.

Key Takeaways:

BM25 excels at exact matches, rare terms, and specific identifiers
Reciprocal Rank Fusion (RRF) combines results without score normalization
Typical hybrid improvement: 10-25% over semantic-only
Start with equal weights (0.5/0.5), then tune based on evaluation

Chapter 9: Context Window Management

Core Concept: Models pay less attention to information in the middle—position matters.

Key Takeaways:

"Lost in the Middle" effect: models attend to beginning and end more than middle
Token budgeting: allocate fixed portions to system prompt, context, history, generation
Dynamic context assembly: build context at query time based on relevance
Summarization reduces tokens while preserving key information

Core Metrics

Retrieval Metrics

Metric	Formula	What It Tells You
Precision@K	Relevant in top K / K	Are your results relevant?
Recall@K	Relevant in top K / Total relevant	Are you finding everything?
F1 Score	2 x (P x R) / (P + R)	Balance of precision and recall
MRR	1 / Rank of first relevant	How quickly do you find something useful?
NDCG@K	DCG@K / IDCG@K	Quality of ranking with graded relevance
MAP	Mean of average precision per query	Overall ranking quality

Rule of thumb: With modern LLMs, prioritize recall over precision. They handle irrelevant context well.

System Performance

Metric	Formula	Target
End-to-end success	P(router correct) x P(retrieval correct)	75%+
Feedback rate	Feedback submissions / Total queries	0.5%+ (5x better than typical)
Experiment velocity	Experiments run per week	5-10 for early systems
Cache hit rate	Cached responses / Total queries	20-40% for semantic cache

Typical Performance Benchmarks

Metric	Typical	Good	Excellent
Feedback rate	0.1%	0.5%	2%+
Recall@10	50%	75%	90%+
Router accuracy	70%	90%	95%+
Re-ranker improvement	5%	12%	20%+
Fine-tuning improvement	3%	6%	10%+
Hard negative boost	6%	15%	30%+

Decision Frameworks

Is It an Inventory Problem or Capability Problem?

Can a human expert find the answer by manually searching?
    |
    +-- NO --> Inventory Problem
    |          Fix: Add missing content
    |
    +-- YES --> Capability Problem
               Fix: Improve retrieval/routing

Should You Fine-tune or Use a Re-ranker?

Do you have 6,000+ labeled examples?
    |
    +-- NO --> Use re-ranker (12-20% improvement, no training needed)
    |
    +-- YES --> Do you have hard negatives?
                    |
                    +-- NO --> Mine hard negatives first, then fine-tune
                    |
                    +-- YES --> Fine-tune embeddings (6-10% improvement)

Bi-encoder vs Cross-encoder Selection

Is latency critical (<100ms)?
    |
    +-- YES --> Bi-encoder only
    |
    +-- NO --> Is precision critical (legal, medical)?
                    |
                    +-- YES --> Bi-encoder + Cross-encoder re-ranking
                    |
                    +-- NO --> Bi-encoder with optional re-ranking

Write-time vs Read-time Computation

Factor	Write-time (Preprocess)	Read-time (On-demand)
Content changes	Rarely	Frequently
Latency requirements	Strict (<100ms)	Flexible (1-2s OK)
Storage budget	Available	Constrained
Query patterns	Predictable	Unpredictable

Hybrid Search Decision

Does your domain have specialized vocabulary or identifiers?
    |
    +-- YES --> Use hybrid search (semantic + BM25)
    |
    +-- NO --> Do users search for exact phrases or codes?
                    |
                    +-- YES --> Use hybrid search
                    |
                    +-- NO --> Semantic search may be sufficient

Vector Database Selection

Do you have existing PostgreSQL expertise?
    |
    +-- YES --> Is your dataset < 1M vectors?
    |               |
    |               +-- YES --> pgvector
    |               +-- NO --> pgvector_scale or migrate
    |
    +-- NO --> Do you want managed infrastructure?
                    |
                    +-- YES --> Pinecone
                    |
                    +-- NO --> Want hybrid search experiments?
                                    |
                                    +-- YES --> LanceDB
                                    +-- NO --> ChromaDB (prototypes) or Turbopuffer (performance)

Key Formulas

Retrieval Metrics

Metric	Formula
Precision@K	`relevant_in_top_k / k`
Recall@K	`relevant_in_top_k / total_relevant`
F1	`2 * (precision * recall) / (precision + recall)`
MRR	`mean(1 / rank_of_first_relevant)`
Cosine Similarity	`(A · B) / (\|\|A\|\| × \|\|B\|\|)`

System Performance

Metric	Formula
End-to-end success	`P(router_correct) × P(retrieval_correct)`
Prioritization score	`Volume × (1 - Satisfaction) × Delta × Relevance`
RRF score	`Σ 1/(k + rank_i(d))` where k=60 typically

Statistical Testing

Calculation	Formula
Sample size (proportions)	`n = (z² × p × (1-p)) / e²`
Confidence interval	`p ± z × sqrt(p(1-p)/n)`
Chi-square statistic	`Σ (observed - expected)² / expected`

Cost Estimation

Monthly cost = 
    (Documents × Tokens/doc × Embedding cost)           # One-time
  + (Queries/day × 30 × Input tokens × Input cost)      # Recurring
  + (Queries/day × 30 × Output tokens × Output cost)    # Recurring
  + Infrastructure                                       # Fixed

Cost Optimization

Typical Cost Breakdown

Component	Percentage	Optimization Potential
Embedding generation	5-10%	Medium (batch, cache)
Retrieval infrastructure	10-20%	High (right-size, cache)
LLM generation	60-75%	High (context size, caching)
Logging/monitoring	5-10%	Low (sample, aggregate)

Cost Reduction Techniques

Technique	Typical Savings	Complexity
Prompt caching	70-90% on repeat queries	Low
Semantic caching	20-30%	Medium
Self-hosted embeddings	50-80% on embedding costs	High
Smaller context windows	30-50% on generation	Low
Batch processing	20-40% on embeddings	Low

Prioritization Matrix

The 2x2 for Query Segments

                    High Volume
                         |
         +---------------+---------------+
         |   DANGER      |   STRENGTH    |
         |   Fix first   |   Maintain    |
         |               |               |
Low -----+---------------+---------------+----- High
Satisfaction             |               Satisfaction
         |               |               |
         |   MONITOR     |   OPPORTUNITY |
         |   Low priority|   Expand      |
         |               |               |
         +---------------+---------------+
                         |
                    Low Volume

Prioritization Score

Score = Volume% × (1 - Satisfaction%) × Achievable Delta × Strategic Relevance

Example: Scheduling queries are 8% of volume, 25% satisfaction, 50% achievable improvement, high strategic relevance = High priority fix

Routing Performance

Few-shot Examples Impact

Examples	Typical Accuracy
5	75-80%
10	85-88%
20	90-92%
40	94-96%

End-to-end Impact

Router Accuracy	Retrieval Accuracy	Overall Success
67%	80%	54%
85%	80%	68%
95%	82%	78%
98%	85%	83%

Chunking Defaults

Content Type	Chunk Size	Overlap	Notes
General text	800 tokens	50%	Good starting point
Legal/regulatory	1500-2000 tokens	30%	Preserve full clauses
Technical docs	400-600 tokens	40%	Precise retrieval
Conversations	Page-level	Minimal	Maintain context

Warning: Chunk optimization rarely gives >10% improvement. Focus on query understanding and metadata filtering first.

Feedback Copy That Works

Do Use

"Did we answer your question?" (5x better than generic)
"Did this run do what you expected?"
"Was this information helpful for your task?"

Do Not Use

"How did we do?" (too vague)
"Rate your experience" (users think you mean UI)
"Was this helpful?" (without context)

After Negative Feedback

Ask specific follow-up:

"Was the information wrong?"
"Was something missing?"
"Was it hard to understand?"

Production Checklists

Before Launch

Baseline metrics established (Recall@5, Precision@5)
50+ evaluation examples covering main query types
Feedback mechanism visible and specific
Error handling and fallbacks implemented
Cost monitoring in place
Graceful degradation tested

Weekly Review

Check retrieval metrics for degradation
Review negative feedback submissions
Analyze new query patterns
Run at least 2 experiments
Update evaluation set with edge cases
Review cost trends

Monthly Review

Cost trend analysis
Query segment performance comparison
Model/embedding update evaluation
Roadmap prioritization refresh
Review routing accuracy
Update training data with new examples

Common Pitfalls by Role

PM Pitfalls

Pitfall	Symptom	Fix
Vague metrics	"Make it better"	Define specific, measurable targets
Premature optimization	Tweaking before measuring	Establish baselines first
Ignoring retrieval	Focus only on generation	Measure retrieval separately
Underinvesting in feedback	Low response rates	Specific prompts, strategic placement

Engineering Pitfalls

Pitfall	Symptom	Fix
Data leakage	Inflated test metrics	Separate train/test splits
Absence blindness	Missing retrieval failures	Log and review retrieval results
Over-engineering	Complex systems, slow iteration	Start simple, add complexity as needed
Ignoring hard negatives	Slow improvement	Mine failures for training data

Quick Lookup: Key Numbers

What	Value	Source
Minimum evaluation examples	50	Chapter 1
Statistical significance sample	200-400	Chapter 1
Fine-tuning minimum examples	6,000	Chapter 2
Few-shot examples for 90% routing	20	Chapter 6
Typical re-ranker improvement	12-20%	Chapter 2
Typical fine-tuning improvement	6-10%	Chapter 2
Typical hybrid search improvement	10-25%	Chapter 8
Target feedback rate	0.5%+	Chapter 3
LLM cost percentage	60-75%	Chapter 7
Semantic cache hit rate target	20-40%	Chapter 7

Reference: Glossary | How to Use This Book
Book Index: Book Overview

Quick Reference

Chapter Summaries

Chapter 0: Introduction - The Product Mindset

Chapter 1: Evaluation-First Development

Chapter 2: Training Data and Fine-Tuning

Chapter 3: Feedback Systems and UX

Chapter 4: Query Understanding and Prioritization

Chapter 5: Specialized Retrieval Systems

Chapter 6: Query Routing and Orchestration

Chapter 7: Production Operations

Chapter 8: Hybrid Search

Chapter 9: Context Window Management

Core Metrics

Retrieval Metrics

System Performance

Typical Performance Benchmarks

Decision Frameworks

Is It an Inventory Problem or Capability Problem?

Should You Fine-tune or Use a Re-ranker?

Bi-encoder vs Cross-encoder Selection

Write-time vs Read-time Computation

Hybrid Search Decision

Vector Database Selection

Key Formulas

Retrieval Metrics

System Performance

Statistical Testing

Cost Estimation

Cost Optimization

Typical Cost Breakdown

Cost Reduction Techniques

Prioritization Matrix

The 2x2 for Query Segments

Prioritization Score

Routing Performance

Few-shot Examples Impact

End-to-end Impact

Chunking Defaults

Feedback Copy That Works

Do Use

Do Not Use

After Negative Feedback

Production Checklists

Before Launch

Weekly Review

Monthly Review

Common Pitfalls by Role

PM Pitfalls

Engineering Pitfalls

Quick Lookup: Key Numbers

Navigation