The RAG Flywheel
A Systematic Approach to Building Self-Improving AI Products
Most RAG implementations struggle in production because teams focus on model selection and prompt engineering while overlooking the fundamentals: measurement, feedback, and systematic improvement.
This guide presents practical frameworks for building RAG systems that become more valuable over time through continuous learning and data-driven optimization.
Book vs Workshops
This documentation is available in two formats:
-
Book: A comprehensive technical reference that synthesizes content from workshops, talks, and office hours into a structured format. The book is organized into four parts with separate guidance for Product Managers and Engineers.
-
Workshops: Original course materials that came directly from the course lectures. These follow the workshop progression and include practical exercises you can apply to your own RAG system.
The Problem: Why Most RAG Systems Fail
The failure pattern repeats across organizations:
- Week 1-2: Demo performs well on prepared examples
- Week 3-4: Users report irrelevant results for real queries
- Week 5-6: Team debates model alternatives without measurement
- Week 7-8: Prompt engineering efforts yield inconsistent improvements
- Week 9+: Usage drops as users lose confidence
The issue isn't technology—it's process. Without systematic measurement and improvement mechanisms, RAG systems degrade as user expectations evolve and edge cases accumulate. The legal tech system from the introduction avoided this trap by implementing evaluation from day one, identifying three distinct failure modes, and building specialized solutions for each pattern.
The Solution: The RAG Improvement Flywheel
Introduction: The Product Mindset Shift
Treating RAG as an evolving product rather than a static implementation fundamentally changes how you approach development, measurement, and improvement.
Key concepts: The improvement flywheel • Common failure patterns • Product thinking vs implementation thinking
Chapter 1: Starting the Data Flywheel
Overcome the cold-start problem using synthetic data techniques. Establish evaluation frameworks and begin measuring improvement within days. The consulting firm case study shows how 200 synthetic queries established baselines that led to 40-point recall improvements.
Topics: Synthetic evaluation datasets • Precision/recall frameworks • Leading vs lagging metrics • Experiment velocity tracking • Production monitoring with the Trellis framework
Chapter 2: From Evaluation to Enhancement
Transform evaluation insights into systematic improvements. Just 6,000 examples can yield 6-10% performance gains through embedding fine-tuning. Re-rankers provide 12-20% improvements with proper implementation. Hard negatives are the secret—they drive 30% gains vs 6% baseline improvements.
Topics: Embedding fine-tuning with contrastive learning • Re-ranker integration (12% improvement at top-5) • Hard negative mining strategies • Fine-tuning cost realities ($100s, not $1000s)
Chapter 3: User Experience and Feedback
Design interfaces that collect high-quality feedback. Changing "How did we do?" to "Did we answer your question?" increases feedback 5x (0.1% to 0.5%). Zapier's case study shows how better copy and visibility drove feedback from 10 to 40 submissions daily. Product-as-sensor thinking turns every interaction into training data.
Topics: High-impact feedback copy patterns • Citation systems for trust building • Implicit signal collection (deletion as negative, selection as positive) • Enterprise Slack integration (5x feedback increase)
Chapter 4: Understanding Your Users
Segment queries to identify high-value patterns. Not all queries deserve equal investment. The 2x2 matrix (volume vs satisfaction) reveals danger zones: high-volume, low-satisfaction segments killing your product. The construction case study shows how 8% of queries (scheduling) drove 35% user churn due to 25% satisfaction.
Topics: Query clustering with K-means and the Cura process • 2x2 prioritization matrix • Inventory vs capabilities framework • Business value formula (Impact × Volume % × Success Rate) • User adaptation blindness
Chapter 5: Building Specialized Capabilities
Build purpose-built retrievers for different content types. One-size-fits-all is why most RAG systems underperform. Different queries need different retrievers: exact matching for SKUs, semantic search for concepts, structured queries for attributes. Google didn't stay one search—they built Maps, Images, Scholar, each specialized. The blueprint search case study jumped from 27% to 85% recall by using vision models for spatial descriptions.
Topics: Two improvement strategies (metadata extraction vs synthetic text) • RAPTOR for long documents (1,500+ pages) • Tool portfolio design • Two-level measurement (P(correct retriever) × P(correct data | retriever))
Chapter 6: Unified Product Architecture
Integrate specialized components through intelligent routing architectures that direct queries to the right tools while maintaining a simple user experience.
Topics: Query routing systems • Tool selection frameworks • Performance monitoring • Continuous improvement pipelines
Conclusion: Product Principles for AI Applications
Core principles that endure beyond specific models or technologies, providing a foundation for AI product development regardless of how the technology evolves.
Industry Perspectives and Case Studies
Practitioners from organizations building production RAG systems share their experiences, failures, and insights.
Selected Talks
How Zapier Improved Their AI Feedback Collection - Practical changes that increased feedback volume and quality
Re-rankers and Embedding Fine-tuning - When and how to use re-rankers for retrieval improvement
When RAG Isn't the Right Solution - Why some coding agents moved away from embedding-based retrieval
Common RAG Anti-patterns - Mistakes to avoid when building RAG systems
Limitations of Public Benchmarks - Why MTEB rankings don't always predict production performance
Who This Book Is For
Product Leaders
- Establish metrics that align with business outcomes
- Build frameworks for prioritizing AI product improvements
- Develop product roadmaps based on data rather than intuition
- Communicate AI capabilities and limitations effectively
Engineers
- Implement systems designed for rapid iteration and continuous improvement
- Make architectural decisions that support evolving requirements
- Build modular, specialized capabilities that can be composed and extended
- Manage technical debt in AI systems
Data Scientists
- Create synthetic evaluation datasets for cold-start scenarios
- Segment and analyze user queries to identify patterns
- Measure retrieval effectiveness beyond simple accuracy metrics
- Build feedback loops that enable continuous learning
About the Author
Jason Liu is a machine learning engineer who has worked on computer vision and recommendation systems at Facebook and Stitch Fix. He has helped organizations implement data-driven RAG systems and teaches practical approaches to building AI products that improve over time.