Kickstarting the Data Flywheel with Synthetic Data
Key Insight
You can't improve what you can't measure—and you can measure before you have users. Synthetic data isn't just a stopgap until real users arrive. It's a powerful tool for establishing baselines, testing edge cases, and building the evaluation infrastructure that will power continuous improvement. Start with retrieval metrics (precision and recall), not generation quality, because they're faster, cheaper, and more objective.
Learn the Complete RAG Playbook
All of this content comes from my Systematically Improving RAG Applications course. Readers get 20% off with code EBOOK. Join 500+ engineers who've transformed their RAG systems from demos to production-ready applications.
Alright, let's talk about making RAG applications actually work. Most teams I work with are stuck in this weird loop where they keep tweaking things randomly and hoping something sticks. Sound familiar?
Here's what we're going to cover: how to set up evaluations that actually tell you something useful, common ways teams shoot themselves in the foot (and how to avoid them), and how to use synthetic data to test your system before you even have users.
Common Pitfalls in AI Development
After working with dozens of companies on their RAG applications, I keep seeing the same patterns. Let me walk you through them so you don't waste months making the same mistakes.
The Reasoning Fallacy
I can't tell you how many times I hear "we need more complex reasoning" or "the model isn't smart enough." Nine times out of ten, that's not the problem. The real issue? You don't actually know what your users want.
Think about it - when was the last time you:
- Actually looked at data from customers?
- Read user feedback (not just the positive reviews)?
- Actively asked users what they're struggling with?
If you're like most teams, the answer is "uhh..." And that's the problem. You end up building these generic tools that don't solve any specific problem particularly well.
The Vague Metrics Problem
Here's another one that drives me crazy. Teams will spend weeks changing things and then evaluate success by asking "does it look better?" or "does it feel right?"
I worked with a company valued at $100 million that had maybe 30 evaluation examples. Total. When something broke or improved, they had no idea what actually changed or why.
Without concrete metrics, you get stuck in this loop:
- Make random changes based on gut feeling
- Get unclear results
- Feel frustrated
- Make more random changes
And round and round you go.
flowchart TD
A[Random Changes Based on Intuition] -->|leads to| B[Unclear Results]
B -->|creates| C[Frustration & Uncertainty]
C -->|triggers more| A
D[Concrete Metrics & Evaluation Framework] -->|breaks| A
style D fill:#90EE90,stroke:#006400,stroke-width:2px
Building Generic Solutions
This one's tough because it often comes from good intentions. You want to build something that helps everyone! But here's what actually happens: you build a generic tool that does everything poorly instead of one thing well.
I see this with teams that have 30-40% churn rates but are too scared to narrow their focus because they might miss out on some hypothetical use case.
My advice? Pick a narrow domain, become world-class at it, then expand. You'll learn way more from 100 happy users in one domain than 1,000 frustrated users across ten.
Leading versus Lagging Metrics
This concept changed how I think about improving systems. I learned it at Facebook, and it's been invaluable for RAG applications.
Lagging Metrics
Lagging metrics are the things you care about but can't directly control:
- Application quality
- User satisfaction
- Churn rates
- Revenue
They're like your body weight - easy to measure, hard to change directly.
Leading Metrics
Leading metrics are things you can control that predict future performance:
- Number of experiments run per week
- Evaluation coverage of different question types
- Retrieval precision and recall
- User feedback collection rate
They're like calories consumed or workouts completed - you have direct control.
The Calories In, Calories Out Analogy
Here's a simple analogy that really drives this home. If you want to lose weight (lagging metric), obsessing over the scale won't help much. What works? Tracking calories in and calories out (leading metrics).
It's not perfect, but it's actionable. You can't directly control your weight today, but you can control whether you eat 2,000 or 3,000 calories.
Same thing with RAG applications. You can't directly make users happy, but you can run more experiments, improve retrieval metrics, and collect more feedback.
The #1 Leading Metric: Experiment Velocity
If I had to pick one metric for early-stage RAG applications, it's this: how many experiments are you running?
Instead of asking "did the last change improve things?" in standup, ask "how can we run twice as many experiments next week?" What infrastructure would help? What's blocking us from testing more ideas?
This shift from outcomes to velocity changes everything.
Absence Blindness and Intervention Bias
These two biases kill more RAG projects than anything else.
Absence Blindness
You can't fix what you can't see. Sounds obvious, right? But I see teams obsess over generation quality while completely ignoring whether retrieval works at all.
I had a client spend three weeks fine-tuning prompts. When we finally checked, their retrieval system was returning completely irrelevant documents. No amount of prompt engineering can fix that.
Questions teams forget to ask:
- Is retrieval actually finding the right documents?
- Are our chunks the right size?
- Is our data extraction pipeline working?
- Do we have separate metrics for retrieval vs generation?
Intervention Bias
This is our tendency to do something just to feel like we're making progress. In RAG, it shows up as constantly switching models, tweaking prompts, or adding features without measuring impact.
"Should we use GPT-4 or Claude?" "Will this new prompt technique help?"
My answer is always the same: depends on your data and evaluations. There's no universal answer.
The fix? Every change should target a specific metric and test a clear hypothesis. No more "let's try this and see what happens."
The RAG Flywheel and Retrieval Evaluations
Here's the thing - everything we learned about search applies to retrieval. If you have a basic RAG setup, your next step is testing whether retrieval actually works.
Why Prioritize Retrieval Evaluations
Teams without ML backgrounds often jump straight to generation evaluations. Bad idea. Here's why retrieval evaluations are better:
- Speed: Milliseconds vs seconds
- Cost: Way cheaper to run
- Objectivity: Clear yes/no answers
- Scalability: Run thousands of tests quickly
When you focus on generation too early, everything becomes subjective. Did the model hallucinate? Is this answer good enough? Who knows?
But with retrieval, it's simple: did you find the right document or not?
Understanding Precision and Recall
Let's make these concepts concrete:
Testing Different K Values:
- Start with K=10
- Test K=3, 5, 10, 20 to understand tradeoffs
- Higher K improves recall but may hurt precision
- Advanced models (GPT-4, Claude) handle irrelevant docs better, so lean toward higher K
Why Score Thresholds Are Dangerous:
- Score distributions vary wildly by query type
- A threshold that works for one category fails for others
- Example: average ada-002 score is 0.7, but 0.5 for ada-003
- Better approach: Always return top K, let the LLM filter
graph TD
subgraph "Document Universe"
subgraph "All Relevant Documents"
A["Relevant &<br>Retrieved<br>(True Positives)"]
B["Relevant but<br>Not Retrieved<br>(False Negatives)"]
end
subgraph "All Retrieved Documents"
A
C["Retrieved but<br>Not Relevant<br>(False Positives)"]
end
D["Not Relevant &<br>Not Retrieved<br>(True Negatives)"]
end
P["Precision = <br>Relevant & Retrieved<br>All Retrieved"]
R["Recall = <br>Relevant & Retrieved<br>All Relevant"]
A ~~~ P
A ~~~ R
classDef relevant fill:#90EE90,stroke:#006400
classDef retrieved fill:#ADD8E6,stroke:#00008B
classDef both fill:#9370DB,stroke:#4B0082
classDef neither fill:#DCDCDC,stroke:#696969
classDef formula fill:#FFFACD,stroke:#8B8B00,stroke-width:2px
class A both
class B relevant
class C retrieved
class D neither
class P,R formula
Recall: What percentage of relevant documents did you find? If there are 10 correct documents and you found 4, that's 40% recall.
Precision: What percentage of your results were actually relevant? If you returned 10 results but only 2 were relevant, that's 20% precision.
With modern LLMs, prioritize recall. They're pretty good at ignoring irrelevant stuff. With simpler models, precision matters more because they get confused easily.
Case Studies: Real-World Improvements
Let me share two examples that show how focusing on retrieval metrics leads to quick wins.
Case Study 1: Report Generation from Expert Interviews
A client generates reports from user research interviews. Consultants do 15-30 interviews and want AI-generated summaries.
Problem: Reports were missing quotes. A consultant knew 6 experts said something similar, but the report only cited 3. That 50% recall rate killed trust.
Solution: We built manual evaluation sets from problematic examples. Turns out, better text chunking fixed most issues.
Result: Recall went from 50% to 90% in a few iterations. Customers noticed immediately.
Lesson: Pre-processing that matches how users query can dramatically improve retrieval.
Case Study 2: Blueprint Search for Construction
Another client needed AI search for construction blueprints - workers asking questions about building plans.
Problem: Only 27% recall when finding the right blueprint for questions.
Solution: We used a vision model to create detailed captions for blueprints, including hypothetical questions users might ask.
Result: Four days later, recall hit 85%. Once live, we discovered 20% of queries involved counting objects, which justified investing in bounding box models.
Lesson: Test subsystems independently for rapid improvements. Synthetic data for specific use cases works great.
Chunk Size Best Practices:
Start with 800 tokens and 50% overlap. This works for most use cases.
Why chunk optimization rarely gives big wins:
- Usually yields <10% improvements
- Better to focus on:
- Query understanding and expansion
- Metadata filtering
- Contextual retrieval (adding document context to chunks)
- Better embedding models
When to adjust:
- Legal/regulatory: Larger chunks (1500-2000 tokens) to preserve full clauses
- Technical docs: Smaller chunks (400-600 tokens) for precise retrieval
- Conversational: Page-level chunks to maintain context
Practical Implementation: Building Your Evaluation Framework
Time to build something that actually works. Your framework should:
- Run questions through retrieval
- Compare results to ground truth
- Calculate precision and recall
- Track changes over time
This becomes your improvement flywheel. Every change gets evaluated against these metrics.
Building a Practical Evaluation Pipeline
Here's a simple but effective evaluation pipeline:
def evaluate_retrieval(evaluation_data, retriever_fn, k=10):
"""
Evaluate retrieval performance on a dataset.
Args:
evaluation_data: List of dicts with 'question' and 'relevant_docs' keys
retriever_fn: Function that takes question text and returns docs
k: Number of top results to consider
Returns:
Dict containing evaluation metrics
"""
results = []
for item in evaluation_data:
question = item['question']
ground_truth = set(item['relevant_docs'])
# Call retrieval system
retrieved_docs = retriever_fn(question, top_k=k)
retrieved_ids = [doc['id'] for doc in retrieved_docs]
# Calculate metrics
retrieved_relevant = set(retrieved_ids) & ground_truth
precision = len(retrieved_relevant) / len(retrieved_ids) if retrieved_ids else 0
recall = len(retrieved_relevant) / len(ground_truth) if ground_truth else 1.0
# Store individual result
results.append({
'question_id': item.get('id', ''),
'question': question,
'precision': precision,
'recall': recall,
'retrieved_docs': retrieved_ids,
'relevant_docs': list(ground_truth),
'metadata': item.get('metadata', {})
})
# Aggregate metrics
avg_precision = sum(r['precision'] for r in results) / len(results)
avg_recall = sum(r['recall'] for r in results) / len(results)
return {
'avg_precision': avg_precision,
'avg_recall': avg_recall,
'detailed_results': results
}
Running Regular Evaluations
Make evaluation part of your routine:
- Continuous testing: Run with every significant change
- Weekly benchmarks: Comprehensive evaluations on schedule
- Version comparison: Always compare new vs old
- Failure analysis: Review 0% recall cases for patterns
- Difficulty progression: Add harder tests as you improve
Production Monitoring Tips:
Track these over time:
- Average cosine distance between queries and retrieved docs
- Percentage of queries with no results above threshold
- Score distributions (watch for bimodal patterns)
Segment by user type:
- New vs returning users have different patterns
- Technical vs non-technical users need different strategies
- Time-based patterns (like during product launches)
Real example: A company's cosine distances spiked during their Super Bowl ad. New users asked different questions, revealing content gaps. They created onboarding content and improved retention 25%.
Integrating with Development Workflow
To maximize value:
- Build a dashboard: Simple interface showing trends
- Automate testing: Part of CI/CD
- Set alerts: Notify when metrics drop
- Document experiments: Track changes and impact
- Connect to business: Link retrieval metrics to outcomes
Your framework should evolve with your application. Start simple, add complexity as needed.
Vector Database Selection Guide
"Which vector database should I use?" gets asked in every office hours. Here's my take based on real deployments.
Understanding Your Requirements
First, figure out:
- Scale: How many documents/chunks?
- Query patterns: Need metadata filtering? SQL? Hybrid search?
- Performance: Latency requirements?
- Infrastructure: Existing database expertise?
- Budget: Open source vs managed?
Vector Database Comparison
Database | Best For | Pros | Cons | When to Use |
---|---|---|---|---|
PostgreSQL + pgvector | Teams with SQL expertise | • Familiar SQL interface • Metadata filtering • ACID compliance • Single database for everything |
• Not optimized for vector operations • Limited to ~1M vectors for good performance • No built-in hybrid search |
When you need to combine vector search with complex SQL queries and already use PostgreSQL |
LanceDB | Experimentation & hybrid search | • One-line hybrid search • Built-in re-ranking • S3 storage support • DuckDB for analytics |
• Newer, less battle-tested • Smaller community |
When you want to quickly test lexical vs vector vs hybrid search approaches |
Timescale pgvector_scale | Large-scale exhaustive search | • Exhaustive search capability • Better scaling than pgvector • Time-series support |
• Requires Timescale • More complex setup |
When you need guaranteed recall on large datasets with time-based queries |
ChromaDB | Simple prototypes | • Easy to get started • Good Python API • Local development friendly |
• Performance issues at scale • Limited production features |
For POCs and demos under 100k documents |
Turbopuffer | High-performance search | • Very fast • Good scaling • Simple API |
• Newer entrant • Limited ecosystem |
When raw performance is critical |
Pinecone | Managed solution | • Fully managed • Good performance • Reliable |
• Expensive at scale • Vendor lock-in |
When you want zero infrastructure management |
Decision Framework
flowchart TD
A[Start] --> B{Do you have<br>existing PostgreSQL?}
B -->|Yes| C{Need exhaustive<br>search?}
B -->|No| D{Want to experiment<br>with hybrid search?}
C -->|Yes| E[pgvector_scale]
C -->|No| F{< 1M vectors?}
F -->|Yes| G[pgvector]
F -->|No| H[Consider migration]
D -->|Yes| I[LanceDB]
D -->|No| J{Budget for<br>managed service?}
J -->|Yes| K[Pinecone]
J -->|No| L{Team size?}
L -->|Small| M[ChromaDB]
L -->|Large| N[LanceDB/Turbopuffer]
style E fill:#90EE90
style G fill:#90EE90
style I fill:#90EE90
style K fill:#90EE90
style M fill:#90EE90
style N fill:#90EE90
Implementation Example: LanceDB for Hybrid Search
We use LanceDB in this course because it makes comparing retrieval strategies trivial:
import lancedb
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector
from lancedb.rerankers import CohereReranker
# Define your schema
class Document(LanceModel):
text: str
source: str
metadata: dict
vector: Vector(get_registry().get("openai").create())
# Create database
db = lancedb.connect("./my_rag_db")
table = db.create_table("documents", schema=Document)
# Add documents
table.add([
Document(text="...", source="...", metadata={...})
for doc in documents
])
# Compare retrieval strategies with one line changes
results_vector = table.search(query).limit(10).to_list()
results_lexical = table.search(query, query_type="fts").limit(10).to_list()
results_hybrid = table.search(query, query_type="hybrid").limit(10).to_list()
# Add re-ranking with one more line
reranker = CohereReranker()
results_reranked = table.search(query).limit(50).rerank(reranker).limit(10).to_list()
This flexibility lets you quickly test what works for your data without rewriting everything.
Migration Considerations
If you're already using a vector database but having issues:
- Measure first: Quantify current performance
- Test in parallel: Run subset of queries against both
- Compare results: Use evaluation metrics to ensure quality
- Plan migration: Consider dual-writing during transition
Don't choose based on hype or benchmarks. Choose based on your team's expertise, query patterns, scaling needs, and budget.
From office hours: "We generally want to use SQL databases. If you use something like Timescale or PostgreSQL, there are many ways of doing time filtering... The general idea is to use structured extraction to identify start and end dates, prompt your language model with an understanding of what those dates are, and then use filtering."
Creating Synthetic Data for Evaluation
No user data yet? No problem. Synthetic data gets you started. But it's not as simple as asking an LLM for "more data" - you need diverse, realistic datasets that reflect real usage.
flowchart TD
A["Document Corpus"] --> B["Sample Representative<br>Text Chunks"]
B --> C["LLM-Powered<br>Question Generation"]
C --> D1["Factual<br>Questions"]
C --> D2["Inferential<br>Questions"]
C --> D3["Comparative<br>Questions"]
C --> D4["Hypothetical<br>Questions"]
D1 & D2 & D3 & D4 --> E["Verify Retrievability<br>with Current System"]
E --> F1["Easy Questions<br>(80-90% expected recall)"]
E --> F2["Medium Questions<br>(50-70% expected recall)"]
E --> F3["Hard Questions<br>(30-50% expected recall)"]
F1 & F2 & F3 --> G["Comprehensive<br>Evaluation Dataset"]
H["Real User<br>Questions"] -.->|"As they become<br>available"| G
style A fill:#CCCCFF,stroke:#0000AA
style C fill:#FFD700,stroke:#B8860B
style E fill:#90EE90,stroke:#006400
style G fill:#FF7F50,stroke:#A52A2A
style H fill:#98FB98,stroke:#006400,stroke-dasharray: 5 5
Basic Approach: Question Generation
Simple version:
- Take random text chunk
- Ask LLM to generate a question this chunk answers
- Verify retrieval finds the original chunk
This creates a basic dataset testing whether your system finds the right info. Recall becomes binary: found it or didn't.
Generating Diverse Synthetic Data
For truly valuable synthetic data, you need variety:
1. Variation in Question Types
Test different retrieval capabilities:
- Factual: Direct questions with explicit answers
- Inferential: Connecting multiple pieces of info
- Comparative: Comparing entities or concepts
- Hypothetical: "What if" scenarios
- Clarification: Asking for elaboration
2. Linguistic Diversity Techniques
Vary phrasing:
- Paraphrasing: Multiple ways to ask the same thing
- Terminology: Synonyms and domain language
- Complexity: Mix simple and compound queries
- Length: Short keywords to verbose questions
- Format: Questions, commands, implied questions
3. Chain-of-Thought Generation
Use CoT for nuanced questions:
Given this text chunk:
[CHUNK]
First, identify 3-5 key facts or concepts in this text.
For each key concept:
1. Think about different ways someone might ask about it
2. Consider various levels of prior knowledge the asker might have
3. Imagine different contexts in which this information might be relevant
Now, generate 5 diverse questions about this text that:
- Vary in complexity and format
- Would be asked by users with different backgrounds
- Target different aspects of the information
- Require different types of retrieval capabilities to answer correctly
For each question, explain your reasoning about why this is a realistic user question.
4. Few-Shot Prompting for Domain Specificity
If you have real user questions, use them:
I'm creating questions that users might ask about [DOMAIN].
Here are some examples of real questions:
1. [REAL QUESTION 1]
2. [REAL QUESTION 2]
3. [REAL QUESTION 3]
Here's a text passage:
[CHUNK]
Please generate 5 new questions similar in style and intent to the examples above,
that would be answered by this text passage.
5. Adversarial Question Generation
Create deliberately challenging questions:
Given this text passage:
[CHUNK]
Generate 3 challenging questions that:
1. Use different terminology than what appears in the passage
2. Require understanding the implications of the content
3. Might confuse a basic keyword search system
4. Would still be reasonable questions a user might ask
For each question, explain why it's challenging and what makes it a good test case.
Building Comprehensive Evaluation Sets
Structure your evaluation sets strategically:
- Coverage mapping: Cover all document types and topics
- Difficulty distribution: Easy (80-90% expected recall), medium (50-70%), hard (30-50%)
- Ground truth validation: Have experts verify questions and docs
- Domain segmentation: Separate sets for different use cases
- Synthetic/real blending: Gradually mix in real data as you get it
Using Synthetic Data Beyond Evaluation
Your synthetic data can do more:
- Retrieval benchmarking: Measuring search quality
- Few-shot examples: Enhancing LLM understanding
- Fine-tuning data: Training specialized embeddings/rerankers
- User experience testing: Simulating realistic sessions
Good synthetic data upfront accelerates everything else.
Model Sensitivity Warning:
Models are more sensitive to irrelevant info than you'd expect. Even GPT-4 and Claude get distracted by marginally relevant content.
This matters for:
- Multi-step reasoning
- Numerical calculations
- Finding specific facts in specific documents
Test by:
- Creating test cases with varying irrelevant info
- Measuring performance degradation
- Using results to set precision/recall tradeoffs
Example: One team found 5 irrelevant documents (even marked as "potentially less relevant") reduced financial calculation accuracy by 30%. They adjusted retrieval to favor precision for numerical queries.
Additional Resources
Tools and Libraries for RAG Evaluation:
- RAGAS: Open-source framework for evaluating RAG applications
- LangChain Evaluation: Tools for evaluating retrieval and generation
- Prompttools: Toolkit for testing and evaluating LLM applications
- MLflow for Experiment Tracking: Open-source platform for managing ML lifecycle
Reflection Questions
Take a minute to think about:
- What are your leading and lagging metrics? How do they connect?
- How could you generate more diverse and challenging synthetic questions for your domain?
- Where does your current evaluation framework fall short? What metrics would help?
- What experiment could you run this week to test an improvement hypothesis?
- How will you incorporate real user feedback as it comes in?
Conclusion and Next Steps
We've covered the foundation for systematic RAG improvement through proper evaluation. No more subjective judgments or random changes - you now have tools to measure progress objectively and make data-driven decisions.
By focusing on retrieval metrics like precision and recall, you can run more experiments faster, cheaper, and with more confidence. This sets you up for incorporating user feedback and advanced techniques.
In Chapter 2, we'll explore converting evaluations into training data for fine-tuning, creating specialized models for your specific needs.
Summary
Key principles to remember:
- Focus on leading metrics - Especially experiment velocity, which you control
- Do the obvious thing repeatedly - Success looks boring, like counting calories
- Combat absence blindness - Check retrieval quality, not just generation
- Avoid intervention bias - Make targeted changes based on hypotheses
- Embrace empiricism - No universal answers, test everything
The goal isn't chasing the latest AI techniques. It's building a flywheel of continuous improvement driven by clear metrics aligned with user outcomes. Start with synthetic data, focus on retrieval before generation, and measure everything.
As one client told me: "We spent three months trying to improve through prompt engineering and model switching. In two weeks with proper evaluations, we made more progress than all that time combined."