Quick Reference
A condensed reference for the key concepts, metrics, and decision frameworks from the workshops.
Core Metrics
Retrieval Metrics
| Metric |
Formula |
What It Tells You |
| Precision@K |
Relevant in top K ÷ K |
Are your results relevant? |
| Recall@K |
Relevant in top K ÷ Total relevant |
Are you finding everything? |
| MRR |
1 ÷ Rank of first relevant |
How quickly do you find something useful? |
Rule of thumb: With modern LLMs, prioritize recall over precision. They handle irrelevant context well.
| Metric |
Formula |
Target |
| End-to-end success |
P(router correct) × P(retrieval correct) |
75%+ |
| Feedback rate |
Feedback submissions ÷ Total queries |
0.5%+ (5x better than typical) |
| Experiment velocity |
Experiments run per week |
5-10 for early systems |
Decision Frameworks
Is It an Inventory Problem or Capabilities Problem?
Can a human expert find the answer by manually searching?
│
├── NO → Inventory Problem
│ Fix: Add missing content
│
└── YES → Capabilities Problem
Fix: Improve retrieval/routing
Should You Fine-tune or Use a Re-ranker?
Do you have 5,000+ labeled examples?
│
├── NO → Use re-ranker (12-20% improvement, no training needed)
│
└── YES → Do you have hard negatives?
│
├── NO → Mine hard negatives first, then fine-tune
│
└── YES → Fine-tune embeddings (6-10% improvement)
Write-time vs Read-time Computation
| Factor |
Write-time (Preprocess) |
Read-time (On-demand) |
| Content changes |
Rarely |
Frequently |
| Latency requirements |
Strict (<100ms) |
Flexible (1-2s OK) |
| Storage budget |
Available |
Constrained |
| Query patterns |
Predictable |
Unpredictable |
Cost Estimation
Monthly cost =
(Documents × Tokens/doc × Embedding cost) # One-time
+ (Queries/day × 30 × Input tokens × Input cost) # Recurring
+ (Queries/day × 30 × Output tokens × Output cost) # Recurring
+ Infrastructure # Fixed
Typical Cost Breakdown
- Embedding generation: 5-10%
- Retrieval infrastructure: 10-20%
- LLM generation: 60-75%
- Logging/monitoring: 5-10%
Cost Reduction Levers
| Technique |
Typical Savings |
Complexity |
| Prompt caching |
70-90% on repeat queries |
Low |
| Semantic caching |
20-30% |
Medium |
| Self-hosted embeddings |
50-80% on embedding costs |
High |
| Smaller context windows |
30-50% on generation |
Low |
Prioritization Matrix
The 2x2 for Query Segments
High Volume
│
┌───────────────┼───────────────┐
│ DANGER │ STRENGTH │
│ Fix first │ Maintain │
│ │ │
Low ─────┼───────────────┼───────────────┼───── High
Satisfaction │ Satisfaction
│ │ │
│ MONITOR │ OPPORTUNITY │
│ Low priority│ Expand │
│ │ │
└───────────────┼───────────────┘
│
Low Volume
Prioritization Score
Score = Volume% × (1 - Satisfaction%) × Achievable Delta × Strategic Relevance
Example: Scheduling queries are 8% of volume, 25% satisfaction, 50% achievable improvement, high strategic relevance → High priority fix
Feedback Copy That Works
Do Use
- "Did we answer your question?" (5x better than generic)
- "Did this run do what you expected?"
- "Was this information helpful for your task?"
Do Not Use
- "How did we do?" (too vague)
- "Rate your experience" (users think you mean UI)
- "Was this helpful?" (without context)
After Negative Feedback
Ask specific follow-up:
- "Was the information wrong?"
- "Was something missing?"
- "Was it hard to understand?"
Chunking Defaults
| Content Type |
Chunk Size |
Overlap |
Notes |
| General text |
800 tokens |
50% |
Good starting point |
| Legal/regulatory |
1500-2000 tokens |
30% |
Preserve full clauses |
| Technical docs |
400-600 tokens |
40% |
Precise retrieval |
| Conversations |
Page-level |
Minimal |
Maintain context |
Warning: Chunk optimization rarely gives >10% improvement. Focus on query understanding and metadata filtering first.
Vector Database Selection
Do you have existing PostgreSQL expertise?
│
├── YES → Is your dataset < 1M vectors?
│ │
│ ├── YES → pgvector
│ └── NO → pgvector_scale or migrate
│
└── NO → Do you want managed infrastructure?
│
├── YES → Pinecone
│
└── NO → Want hybrid search experiments?
│
├── YES → LanceDB
└── NO → ChromaDB (prototypes) or Turbopuffer (performance)
Few-shot Examples Impact
| Examples |
Typical Accuracy |
| 5 |
75-80% |
| 10 |
85-88% |
| 20 |
90-92% |
| 40 |
94-96% |
End-to-end Impact
| Router Accuracy |
Retrieval Accuracy |
Overall Success |
| 67% |
80% |
54% |
| 85% |
80% |
68% |
| 95% |
82% |
78% |
| 98% |
85% |
83% |
Production Checklist
Before Launch
Weekly Review
Monthly Review
Key Numbers to Remember
| Metric |
Typical |
Good |
Excellent |
| Feedback rate |
0.1% |
0.5% |
2%+ |
| Recall@10 |
50% |
75% |
90%+ |
| Router accuracy |
70% |
90% |
95%+ |
| Re-ranker improvement |
5% |
12% |
20%+ |
| Fine-tuning improvement |
3% |
6% |
10%+ |
| Hard negative boost |
6% |
15% |
30%+ |
Return to Workshop Index | See Glossary for term definitions