Converting Evaluations into Training Data for Fine-Tuning
Key Insight
If you're not fine-tuning, you're Blockbuster, not Netflix. The goal isn't to fine-tune language models (which are expensive and complex), but to fine-tune embedding models that move toward your specific data distributions and improve retrieval, not generation.
Learn the Complete RAG Playbook
All of this content comes from my Systematically Improving RAG Applications course. Readers get 20% off with code EBOOK. Join 500+ engineers who've transformed their RAG systems from demos to production-ready applications.
Fine-Tuning Cost Reality Check
Embedding Model Fine-Tuning: - Cost: ~$1.50 for 6,000 examples - Time: 40 minutes on a laptop - Infrastructure: Consumer GPU or cloud notebook - Improvement: 6-10% better performance
**Language Model Fine-Tuning:**
- Cost: $100-1000s depending on model size
- Time: Hours to days
- Infrastructure: Multiple GPUs or specialized services
- Complexity: Requires ML expertise
This dramatic difference explains why embedding fine-tuning should be your first focus.
Introduction
In the previous chapter, we set up evaluation and generated synthetic data to benchmark our RAG system. Now let's talk about how to actually use that data to improve things.
Prerequisites from Previous Chapters: - Chapter 0: Understanding the improvement flywheel concept - Chapter 1: Creating evaluation datasets with synthetic data
The evaluation examples from Chapter 1 become your training data in this chapter.
Here's the thing: the data you collect for evaluation shouldn't just sit there. Every question, every relevance judgment, every piece of feedback—it can all be used to improve your system. That's what we'll cover here.
Key Philosophy: "Every evaluation example is a potential training example. The data flywheel transforms what begins as a handful of evaluation examples into few-shot prompts, then into training datasets for fine-tuning embedding models and re-rankers."
The process is straightforward: you start with evaluation examples, turn them into few-shot prompts, then eventually use them to fine-tune your embedding models and re-rankers. Each step builds on the last.
Why Generic Embeddings Fall Short
Let me start with something that trips up a lot of teams: generic embeddings from providers like OpenAI often don't work great for specialized applications. They're good models, don't get me wrong. But they're built to handle everything, which means they don't handle your specific thing particularly well.
Limitation of Generic Models
Generic embedding models come with built-in assumptions about what "similarity" means. You don't know:
- What data they were trained on
- How they defined "success" during training
- How they weighted different types of similarity
- What trade-offs they made to work for everyone
The Elusive Nature of "Similarity"
Embedding models seem simple enough: they turn text into numbers, and similar text should end up with similar numbers. Measure the distance between vectors and you know how similar things are.
Domain-Specific Similarity Example: In e-commerce, what makes two products "similar"? Are they substitutes (different brands of red shirts) or complements (a shirt and matching pants)? Depends on what you're trying to do.
Take music recommendations. Songs might be similar because they're the same genre, or because they show up in the same playlists, or because the same people like them. If you're adding songs to a playlist, you want one kind of similarity. If you're building Spotify's Discovery Weekly, you want something else entirely.
My favorite example is from dating apps. Should "I love coffee" and "I hate coffee" be similar? Linguistically, they're opposites. But both people care about coffee enough to mention it. Maybe opposites don't attract when it comes to beverages. Or maybe they do if one person likes tea and the other likes coffee.
Here's the thing: What actually matters for a dating app is whether two people will like each other, not whether their profiles use similar words. Generic embeddings trained on web text have no idea about this.
The problem is that "similarity" means different things in different contexts. There's no universal right answer—it depends on what you're trying to do.
The Hidden Assumptions in Provider Models
When you use OpenAI or Cohere's embeddings, you're stuck with their definition of similarity. It might not match what you need.
Legal Document Search Example: One memorable case involved a legal document search application. The generic embeddings performed reasonably well for finding factual information but struggled with procedural questions. The embeddings didn't adequately capture the relationships between legal procedures and their applications—a specific type of similarity vital to legal professionals but not emphasized in general-purpose training data.
Provider embeddings aren't bad—they're great for general use. But your application probably isn't general. Fine-tuning with your own data fixes this mismatch.
From Evaluation to Few-Shot Examples
Before jumping into fine-tuning, there's something simpler you can try: few-shot examples. Let's talk about turning your evaluation data into prompts that actually work.
The Power of Examples in Context
Few-shot learning is pretty straightforward: instead of retraining the model, you just show it some examples in the prompt. No special infrastructure needed.
How Few-Shot Learning Works
When you provide a language model with examples of how to respond to similar queries, you activate its ability to recognize patterns and apply them to new inputs. It's like showing a human a few examples of a task before asking them to perform it themselves—no specialized training required, just clear demonstrations.
This works especially well for RAG because different types of questions need different approaches. Show the model a few examples and it figures out what kind of question it's dealing with.
Selecting the Right Examples
Don't just grab random examples from your evaluation set. I've watched teams do this and make their model worse. You need to pick the right examples.
Characteristics of Good Examples:
- Match the queries your users actually ask
- Show clear reasoning steps
- Cover different types of questions
- Aren't too weird or specific
Remember the synthetic data generation techniques we explored in Chapter 1? You can use those same methods to generate examples specifically for few-shot learning. The key difference is that for few-shot examples, you need not just questions and answers, but also the reasoning process that connects them.
Building Your Few-Shot Library
Build yourself a library of few-shot examples. Here's how I do it:
- Filter your evaluation data for the best examples
- Group them by type (factual questions, how-to guides, comparisons)
- Pick representative examples from each group
- Format them consistently
- Test them with your actual pipeline
- Keep the ones that work, toss the ones that don't
Structured Few-Shot Prompt Example
You are an assistant specialized in answering questions about [domain].
Here are some examples of how to answer questions:
Question: [Example Question 1]
Thinking: [First, I'll identify the key entities in the question. Then I'll look for information about their relationship...]
Answer: [Example Answer 1]
Question: [Example Question 2]
Thinking: [This appears to be a comparison question. I should look for information about both entities and highlight similarities and differences...]
Answer: [Example Answer 2]
Now please answer the following question:
Question: [Actual User Query]
Having an organized library means you can track what works, swap out examples that get stale, and keep improving based on what your users actually do.
Practical Implementation: Building the Data Flywheel for Fine-Tuning
Few-shot examples are great, but fine-tuning your embeddings is where you see real improvements. The trick is getting enough good data to make it worth doing.
Starting the Flywheel
Building a RAG system is iterative. You start with a few examples for evaluation. Those become few-shot prompts. Eventually you have enough for fine-tuning. Each stage builds on the last.
Data Collection Milestones: - With 20 examples, you can build basic evaluation benchmarks - With 30 examples, you can create effective few-shot prompts - With 1000+ examples, you can fine-tune your retrieval models
What's nice is you're not throwing away data—you're using it differently as you get more of it.
Start Collecting Now
You need to start collecting the right data now, even if you're not ready to fine-tune yet. The sooner you start logging relevant user interactions, the sooner you'll reach the critical mass needed for fine-tuning.
What Data Should You Log?
For RAG, here's what you should be logging:
- What users actually ask
- Which chunks got used in responses
- Which responses users liked (or didn't)
- Which queries needed follow-ups
This tells you what's actually relevant to what—which is exactly what you need for fine-tuning.
Domain-Specific Relevance Signals
For other applications, the relevance signals will differ:
- In e-commerce: track which items are purchased together, viewed in sequence, or added to the same lists
- For music recommendations: log which songs appear in the same playlists or are hearted by the same users
- For dating apps: record which profiles match and go on to have meaningful conversations
The key is defining what "relevance" means in your specific context and systematically collecting data that captures this relationship.
Start Logging Yesterday!
I've seen numerous companies hire machine learning engineers to fine-tune embedding models, only to realize they hadn't started logging relevance data. These teams then have to wait 3-6 months to collect enough data before they can begin the work they intended to do immediately.
The most important action you can take today is to start logging relevance data, even if you're not ready to hire ML specialists or begin fine-tuning. Save the top 20-40 chunks for each query and use an LLM to mark relevance if human annotation isn't feasible. This data will be invaluable when you're ready to improve your models.
I watched a team build a great RAG app for internal docs. Six months later they wanted to fine-tune embeddings but had zero data because they never set up logging. Had to start from scratch with synthetic data. Don't do this.
Small Datasets Can Make Big Differences
The team at Sentence Transformers has demonstrated that even with just 6,000 examples, you can achieve 6-10% better performance. With 40 minutes of fine-tuning on a laptop, you can create significant lifetime value for your application. This makes fine-tuning embedding models accessible even to teams without massive datasets or specialized infrastructure.
Understanding Contrastive Learning for Embeddings
Let's talk about how fine-tuning actually works. Most approaches use something called contrastive learning.
Learning Through Contrasts
Contrastive learning is simple: you teach the model what's similar by showing it what's different. It's all about relationships, not absolute values.
Triplet Structure
The most common implementation uses a structure called a triplet, which consists of:
- An anchor (usually the query)
- A positive example (a document that's relevant to the query)
- A negative example (a document that's not relevant to the query)
The goal of training is straightforward: adjust the embedding model so that the distance between the anchor and positive example decreases, while the distance between the anchor and negative example increases. In other words, pull similar things closer together and push dissimilar things further apart.
graph LR
A[Anchor: Query] --- P[Positive: Relevant Document]
A --- N[Negative: Irrelevant Document]
P -.- |"Pull Closer"| A
N -.- |"Push Away"| A
````
This works great for embeddings because you're directly optimizing the distance relationships that matter for retrieval.
### Creating Effective Triplets for RAG
For RAG applications, there are several natural ways to create triplet datasets:
- **Anchor**: The user's query
- **Positive**: Document chunks that were cited in the final response or received positive feedback
- **Negative**: Document chunks that were retrieved but not cited, or received negative feedback
### Healthcare RAG Triplet Example
Imagine a healthcare RAG application where a user asks:
Our retrieval system might return several documents, including:
Document B: "Medication X is used to treat high blood pressure and should be taken with food."
If Document A is cited in the response while Document B isn't, we can create a triplet:
```json
{
"anchor": "What are the side effects of medication X?",
"positive": "Medication X may cause drowsiness, nausea, and in rare cases, allergic reactions.",
"negative": "Medication X is used to treat high blood pressure and should be taken with food."
}
Through many such examples, the model learns that queries about side effects should be closer to texts describing adverse reactions than to texts describing indications or administration instructions.
The Challenge of Hard Negatives and How UX Can Help
Notice something subtle in that example? The negative document is still about the same medication—just not about side effects. That's a "hard negative": similar in some ways, different in the ways that matter.
Hard Negative Mining Strategies
Effective Approaches:
- Semantic Similarity with Different Intent:
- "Software engineer" vs "Software engineering recruiter"
-
Both about software roles, but serving different user needs
-
User Deletion Signals:
- Track which documents users actively remove from results
-
These are perfect hard negatives - retrieved but explicitly rejected
-
Category Boundaries:
- Items from adjacent but different categories
-
Example: "Red running shoes" vs "Red dress shoes"
-
Temporal Relevance:
- Outdated versions of correct information
- Example: "2023 tax rates" when user needs "2024 tax rates"
Agentic Retrieval Perspective
Colin Flaherty's work on agentic coding systems reveals a surprising insight: "We found that for SweeBench tasks, embedding-based retrieval was not the bottleneck - grep and find were sufficient." The agent's persistence effectively compensated for less sophisticated tools. This suggests that while fine-tuning embeddings is valuable, the agent layer can sometimes overcome retrieval limitations through persistence. Learn more about agentic approaches →
Value of Hard Negatives
Hard negatives are way more valuable than easy ones. If your negative example was about car maintenance instead of medications, the model learns nothing—it already knows car maintenance isn't relevant to medication side effects.
The real challenge is teaching the model to distinguish between different aspects of the same topic. That's where you get actual improvements.
That's why hard negative mining matters—finding examples that are tricky but teachable.
Designing UX for Better Training Data
If you're serious about improving your embeddings, consider explicitly designing your UX to capture these signals:
-
Document-level feedback mechanisms: Add simple thumbs up/down options next to each retrieved document, not just for the final answer
-
Click tracking: Record which documents users click on and which they ignore—those ignored despite ranking highly are excellent hard negative candidates
-
Dwell time analysis: If a user quickly returns from a document without spending time reading it, that's a strong signal it wasn't relevant
-
Explicit comparison interfaces: For critical applications, consider interfaces that ask users to compare documents and select the most relevant one
-
Query reformulation tracking: When a user modifies their query slightly and gets better results, you can pair the original query with documents from the improved results to create training pairs
One team I worked with added a "more like this" button next to helpful documents. Users loved it, and it gave us perfect training data about what users actually consider similar—which often wasn't what we expected from just reading the text.
The Power of Re-Rankers in RAG Systems
Embeddings do the heavy lifting in retrieval, but re-rankers add polish. The difference: embeddings process queries and documents separately, while re-rankers look at them together and can make smarter decisions.
Bi-Encoders vs. Cross-Encoders: Understanding the Trade-offs
Here's the trade-off: embeddings are fast, re-rankers are accurate.
Model Comparison
Bi-encoders (embedding models): - Encode query and document independently - Allow pre-computation of document embeddings - Enable fast vector similarity operations - Work well for first-pass retrieval of candidates - Examples include OpenAI's text-embedding models, SBERT, MPNet
Cross-encoders (re-rankers): - Process query and document together as a pair - Cannot pre-compute relevance scores - Provide more accurate relevance judgments - Work best for re-ranking a smaller set of candidates - Examples include Cohere Rerank, monoT5
Use them together: embeddings grab candidates quickly, re-ranker sorts them properly.
Re-Ranker Success Story: One team I worked with was debating whether to invest in fine-tuning their embeddings or implementing a re-ranker. When they tested both approaches, they found that fine-tuning embeddings improved recall from 65% to 78%, while adding a re-ranker (even without fine-tuning) improved it to 82%. Combining both approaches pushed performance to 91%—a transformative improvement from where they started.
Creating Training Data for Re-Rankers
Re-rankers work better with graded relevance scores instead of just yes/no labels. Try this:
- Score query-document pairs on a scale (like 0-5)
- Include the full range of scores
- Train the model to predict these scores
Graded Relevance Example
{
"query": "How do I reset my password?",
"documents": [
{"text": "Step-by-step password reset guide", "score": 5},
{"text": "General account management information", "score": 3},
{"text": "Creating a strong password", "score": 2},
{"text": "About our company", "score": 0}
]
}
This helps the re-ranker understand degrees of relevance, not just binary yes/no. Users notice the difference.
Practical Fine-Tuning Workflow
Here's a workflow that actually works, based on what I've seen teams do successfully.
When to Fine-Tune Embeddings
Fine-tune your embedding models when:
- You have 6,000+ query-document pairs with relevance labels
- Domain-specific terminology isn't well-represented in generic models
- Your definition of similarity differs from general language understanding
- Cost at scale justifies maintaining your own infrastructure
Production Insight
From office hours: "With just 6,000 examples from your domain, you can train embedding models and cross-encoders that outperform general-purpose models on your specific tasks. This typically costs around $1.50 and takes about 40 minutes on a laptop."
The Fine-Tuning Process
Step 1: Data Preparation
Transform your evaluation data into training format:
- Positive pairs: Query-document combinations that should rank highly
- Hard negatives: Similar but incorrect documents for each query
- Validation set: Hold out 20% for testing improvements
Critical Success Factor
The quality of your hard negatives determines the quality of your fine-tuned model. Documents that are topically similar but serve different intents make the best hard negatives.
Step 2: Model Selection
Choose your base model wisely:
- For English-only: Modern BERT models with 8,000 token context (vs original 512)
- For multilingual: Cohere's multilingual models or mE5
- For specialized domains: Start with models pre-trained on similar content
Step 3: Training Infrastructure
You don't need massive infrastructure:
- Local training: Consumer GPU with 8GB+ VRAM
- Cloud notebooks: Colab Pro or similar services
- Training time: 30-60 minutes for most datasets
- Cost: Under $5 for most use cases
Measuring Success
Track these metrics before and after fine-tuning:
- Recall@K at different values (5, 10, 20)
- Mean Reciprocal Rank (MRR)
- Business metrics tied to retrieval quality
- Latency impact if self-hosting
Real-World Results
A healthcare company fine-tuned embeddings on medical abbreviations where generic models confused similar acronyms. Results: - Recall@10 improved from 72% to 89% - Reduced confusion between similar medical terms - Cost: $1.50 in compute, 45 minutes of training - ROI: Prevented multiple medical documentation errors
Common Pitfalls to Avoid
- Training on too little data: Wait until you have at least 6,000 examples
- Ignoring hard negatives: Easy negatives don't improve the model
- Not validating on real queries: Synthetic data alone isn't sufficient
- Over-optimizing on metrics: Ensure improvements translate to user experience
Resources for Implementation
For detailed implementation guides:
- Sentence Transformers Training Documentation
- Cohere's Fine-tuning Guide
- OpenAI's Fine-tuning Best Practices
Key Takeaway
"It's probably a bad idea to train your own language model, but it's a very good idea to train your own embedding model. The infrastructure requirements are minimal, the process is well-understood, and the improvements are substantial for domain-specific applications."
Testing Different Approaches Systematically
Now that you have evaluation data from Chapter 1, you can start testing different approaches. You need to be systematic about this.
Good Experimentation Practices
For each experiment:
- Have a clear hypothesis ("Re-ranker will improve recall@10 by 15%")
- Define success before you start
- Change one thing at a time
- Measure using your established metrics
- Document what worked AND what didn't
Experiments worth trying:
- Different embedding models (OpenAI vs Cohere vs open-source)
- Chunk sizes and overlaps
- Lexical vs semantic vs hybrid retrieval
- Adding a re-ranker
- Different few-shot examples
Debate Resolved Through Data
One team I worked with spent weeks debating which embedding model to use, with different team members advocating for their preferred option. Instead of continuing the debate, they implemented a simple experiment: they indexed their documents with three different embedding models and measured recall on their evaluation set. The results settled the debate in hours, not weeks, and the team moved forward with data-backed confidence.
Keep running experiments, measuring results, and iterating. That's how you get better.
Building a Roadmap for Continuous Improvement
Once you've run some experiments, you can plan what to do next. It's not just about adding features—it's about having a process.
Prioritization Framework
When deciding what to work on:
- Impact: What will move the needle most?
- Effort: How much work is it?
- Dependencies: What needs to be done first?
- Risk: What could break?
I like using a simple impact/effort grid. High impact, low effort goes first—those quick wins build momentum while you're working on the harder stuff.
Prioritization in Action
In one project, we identified that implementing BM25 hybrid retrieval would be high-impact and medium-effort, while fine-tuning custom embeddings would be high-impact but high-effort. We prioritized the hybrid retrieval first, which gave us immediate gains while we collected data for the eventual embedding fine-tuning.
Linear Adapters: A Cost-Effective Alternative
Before diving into full fine-tuning, consider linear adapters - a technique that can deliver significant improvements at a fraction of the cost.
What Are Linear Adapters?
Linear adapters add a small trainable layer on top of frozen embeddings: - Train only a linear transformation matrix - Keep the base embedding model unchanged - Combine benefits of domain specificity with pre-trained knowledge
**Cost Comparison:**
- Full fine-tuning: $50-100 for meaningful datasets
- Linear adapters: ~$12 for the same improvement
- Training time: Minutes vs hours
When to Use Linear Adapters
Perfect for: - Domain-specific terminology mapping - Multi-domain applications (train separate adapters) - Rapid experimentation - Limited computational resources
**Implementation:**
```python
# Simplified example
base_embeddings = model.encode(texts)
adapted_embeddings = linear_adapter(base_embeddings)
```
You can train different adapters for different query types or domains, switching between them based on query classification.
Additional Resources
Tools and Libraries
### Understanding Embedding Models
1. **Sentence Transformers Library** ([https://www.sbert.net/](https://www.sbert.net/)): This library provides easy-to-use implementations for state-of-the-art embedding models, supporting both pairwise datasets and triplets for fine-tuning. It's my recommended starting point for most teams due to its balance of performance and ease of use.
2. **Modern BERT** ([https://huggingface.co/sentence-transformers](https://huggingface.co/sentence-transformers)): These newer models offer 8,000 token sequence lengths and generally outperform classic BERT-based models. The BGE models in particular have shown excellent performance across many domains and are worth testing in your applications.
3. **Cohere Re-ranking Models** ([https://cohere.com/rerank](https://cohere.com/rerank)): Cohere offers state-of-the-art re-ranking capabilities with a fine-tuning API that makes it relatively easy to customize for your specific needs. In my experience, even their base re-ranker without fine-tuning often provides substantial improvements to retrieval quality.
4. **Specialized Domains**: For specific domains like code, science, or legal documents, look for models pre-trained on related corpora. For example, CodeBERT for programming or SciBERT for scientific literature can provide better starting points than general models.
5. **Comparison to Data Labeling**: Everything we're doing today with fine-tuning embedding models is what I used to pay data labeling teams hundreds of thousands of dollars to do annually. The ML playbook that was once only accessible to large companies with significant budgets is now available to teams of all sizes thanks to advances in transfer learning and fine-tuning techniques.
Key Concepts
#### Contrastive Learning In-Depth
Contrastive learning trains models to recognize similarities and differences between items by pushing and pulling examples in the embedding space:
- **Triplet Loss**: Optimizes the distance between anchor-positive pairs relative to anchor-negative pairs
- **InfoNCE Loss**: Contrasts a positive pair against multiple negative examples
- **Multiple Negatives Ranking Loss**: Handles batches of queries with multiple negatives per query
#### Scaling and Efficiency Considerations
For large datasets or production workloads:
- Consider parallel processing frameworks (like Modal) to accelerate embedding and training
- Experiment with multi-GPU training for faster iterations
- Evaluate the trade-offs between API costs and self-hosting
- Test multiple model variations simultaneously to find optimal configurations
Reflection Questions
!!! question "Self-Assessment" 1. What specific definition of "similarity" is most important for your application's domain?
2. How would you create effective few-shot examples from your existing evaluation data?
3. What user interactions in your application could provide valuable training signals for fine-tuning?
4. If you had to prioritize one retrieval improvement for your system, would it be embeddings, re-ranking, or something else? Why?
5. What experiments could you run to test your hypotheses about improving retrieval quality?
Conclusion and Next Steps
We covered a lot:
- Turning evaluation examples into few-shot prompts
- Why generic embeddings often aren't good enough
- Building datasets for fine-tuning
- How contrastive learning works
- Running systematic experiments
- Planning improvements
The main takeaway: don't waste your data. Every question, every bit of feedback, every evaluation—it can all make your system better if you capture and use it.
Fine-tuning embeddings really works, and unlike fine-tuning LLMs, it's actually doable. You can see real improvements with just 6,000 examples.
What's Coming Next
In Chapter 3, we'll dive into deployment strategies, user feedback collection methods, and how to use this feedback to further refine your RAG application. We'll explore practical techniques for gathering implicit and explicit feedback, designing effective user interfaces, and closing the loop between user interactions and system improvements.
!!! info "Related Concepts in Other Chapters" - Query Segmentation (Chapter 4): Learn how to identify which queries benefit most from fine-tuning - Specialized Models (Chapter 5): See how fine-tuned embeddings power specialized retrievers - Router Optimization (Chapter 6): Understand how fine-tuning improves query routing
Summary
Start with evaluation, add few-shot examples, work up to fine-tuning. Each step makes the next one better.
Do these things now:
- Start logging data - Even if you're not ready to use it
- Define what "similar" means for you - It's different for every application
- Add feedback mechanisms - You'll need the data later
- Build your few-shot library - Start small, grow over time
- Try domain-specific models - They might already solve your problem
If you do this right, every piece of data makes your system better. The improvements compound over time and affect everything—clustering, topic modeling, all of it.