Skip to content

Performance Measurement and Improvement: Building Learning Systems

Key Insight

Measure both retrieval AND routing—a perfect retriever is useless if the router never calls it. Your system's performance is the product of routing accuracy and retrieval quality. Track tool selection precision (did we pick the right tool?), retrieval recall (did the tool find the answer?), and end-to-end success. The compound effect means 90% routing × 90% retrieval = 81% overall success.

Learn the Complete RAG Playbook

All of this content comes from my Systematically Improving RAG Applications course. Readers get 20% off with code EBOOK. Join 500+ engineers who've transformed their RAG systems from demos to production-ready applications.

Introduction

This part explores how to measure, test, and continuously improve a unified RAG system:

  • Testing and measuring performance of both retrieval and routing components
  • Creating user interfaces that leverage both AI and direct tool access
  • Building systems that scale across teams and complexity levels
  • Creating continuous improvement cycles through user feedback

Testing Query Routing Effectiveness

Just as we need metrics for retrieval quality, we need metrics for routing quality. The fundamental question is: are we selecting the right tools for each query?

Tool Selection Metrics

To evaluate tool selection, we need a test dataset with queries annotated with the correct tool(s) to use. From there, we can calculate:

  1. Tool Precision: When we select a tool, how often is it actually the right one?
  2. Tool Recall: How often do we select all the tools that should be selected?
  3. Tool F1 Score: The harmonic mean of precision and recall
  4. Per-Tool Recall: How often each specific tool is correctly selected when it should be

Data Leakage Risk

When creating test datasets for router evaluation, be vigilant about data leakage. If your few-shot examples appear in your test set, you'll get artificially high performance that won't generalize to real queries. Always maintain separate development and test sets with distinct query patterns.

Here's a sample evaluation for a construction information system's query router:

Query ID Query Text Expected Tools Realized Tools Precision Recall
1 Retrieve blueprints for the museum expansion SearchBlueprint SearchBlueprint 100% 1/1
2 Find schedule and documents for the library renovation SearchSchedule, SearchText SearchSchedule 100% 1/2
3 Get both blueprints and schedule for campus construction SearchBlueprint, SearchSchedule SearchBlueprint, SearchSchedule 100% 2/2
4 Show me contract details and permit requirements for the new office SearchText, SearchBlueprint SearchText, SearchBlueprint, SearchSchedule 67% 2/2
5 Identify materials and design specs for the downtown skyscraper SearchText, SearchBlueprint SearchBlueprint, SearchText 100% 2/2
6 Get full details on industrial park planning SearchBlueprint, SearchText, SearchSchedule SearchText, SearchInvoice, SearchPermit 33% 1/3
7 Find emergency repair guidelines for the abandoned warehouse SearchRepair, SearchBlueprint SearchText 0% 0/2
8 Obtain comprehensive analysis for the urban redevelopment project SearchBlueprint, SearchText, SearchSchedule, SearchPermit SearchBlueprint 100% 1/4
9 Explain zoning regulations for the new industrial area SearchZoning SearchBlueprint, SearchText 0% 0/1

Looking at overall metrics, this system achieves:

  • Average Precision: 67%
  • Average Recall: 56%
  • Average F1 Score: 61%

These aggregate metrics are useful, but they don't tell the complete story. What's often more revealing is the per-tool recall:

Tool Times Expected Times Selected Correctly Per-Tool Recall
SearchBlueprint 6 4 67%
SearchText 5 3 60%
SearchSchedule 4 2 50%
SearchPermit 1 0 0%
SearchZoning 1 0 0%
SearchRepair 1 0 0%

This breakdown shows that less common tools (Permit, Zoning, Repair) have extremely low recall, suggesting that our router doesn't have enough examples of these tools to recognize when they should be used.

Automating Router Evaluation

Here's a code example for evaluating router performance:

def evaluate_router(router_function, test_dataset):
    """
    Evaluate a routing function against a test dataset.

    Args:
        router_function: Function that takes a query and returns tool selections
        test_dataset: List of {query, expected_tools} pairs

    Returns:
        Dictionary of evaluation metrics
    """
    results = []
    tool_expected_count = {}
    tool_selected_count = {}
    tool_correct_count = {}

    for test_case in test_dataset:
        query = test_case["query"]
        expected_tools = set(test_case["expected_tools"])

        # Track expected tools
        for tool in expected_tools:
            tool_expected_count[tool] = tool_expected_count.get(tool, 0) + 1

        # Get router predictions
        selected_tools = set(router_function(query))

        # Track selected tools
        for tool in selected_tools:
            tool_selected_count[tool] = tool_selected_count.get(tool, 0) + 1

        # Calculate precision and recall for this query
        correct_tools = expected_tools.intersection(selected_tools)
        for tool in correct_tools:
            tool_correct_count[tool] = tool_correct_count.get(tool, 0) + 1

        precision = len(correct_tools) / len(selected_tools) if selected_tools else 1.0
        recall = len(correct_tools) / len(expected_tools) if expected_tools else 1.0
        f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

        results.append({
            "query": query,
            "expected_tools": expected_tools,
            "selected_tools": selected_tools,
            "precision": precision,
            "recall": recall,
            "f1": f1
        })

    # Calculate overall metrics
    avg_precision = sum(r["precision"] for r in results) / len(results)
    avg_recall = sum(r["recall"] for r in results) / len(results)
    avg_f1 = sum(r["f1"] for r in results) / len(results)

    # Calculate per-tool recall
    per_tool_recall = {}
    for tool in tool_expected_count:
        if tool_expected_count[tool] > 0:
            per_tool_recall[tool] = tool_correct_count.get(tool, 0) / tool_expected_count[tool]
        else:
            per_tool_recall[tool] = 0

    return {
        "detailed_results": results,
        "avg_precision": avg_precision,
        "avg_recall": avg_recall,
        "avg_f1": avg_f1,
        "per_tool_recall": per_tool_recall,
        "tool_expected_count": tool_expected_count,
        "tool_selected_count": tool_selected_count,
        "tool_correct_count": tool_correct_count
    }

Analyzing Tool Selection Failures

When tool selection fails, we need to understand why. A confusion matrix is particularly useful here, showing which tools are being confused with one another.

For example, if we find that the SearchBlueprint tool is never being selected even when it should be, we might need to improve its description or add more examples to the system prompt.

Confusion Matrix Analysis

Imagine our evaluation produces this confusion matrix:

Expected\Selected SearchText SearchBlueprint SearchSchedule
SearchText 85 5 10
SearchBlueprint 40 50 10
SearchSchedule 15 5 80

This shows that SearchBlueprint is frequently mistaken for SearchText, indicating that we need to better differentiate these tools.

Targeted Improvement Strategy

Once you've identified specific weaknesses in your router, you can implement targeted improvements:

  1. For low-recall tools:

  2. Add more few-shot examples for these tools

  3. Improve tool descriptions to more clearly differentiate them
  4. Consider whether these tools are truly distinct or should be merged

  5. For commonly confused tools:

  6. Analyze failure cases to understand what's causing the confusion

  7. Create "contrast examples" that explicitly show why similar queries go to different tools
  8. Refine tool interfaces to have clearer boundaries

  9. For overall improvement:

  10. Balance your few-shot examples across all tools
  11. Include edge cases that test the boundaries between tools
  12. Add multi-tool examples that show when multiple tools should be used together

Synthetic Data Generation for Router Testing

You can use synthetic data techniques to create comprehensive test cases for your router:

  1. Start with clear definitions of each tool's purpose
  2. Use an LLM to generate diverse queries that should trigger each tool
  3. Include variants of each query with slightly different wording
  4. Generate ambiguous queries that could reasonably go to multiple tools
  5. Create a balanced dataset that covers all tools proportionally

This approach ensures comprehensive coverage of your router's decision space without requiring extensive manual labeling.

User Interfaces: Direct Tool Access

One powerful insight from the routing architecture is that tools designed for language models can often be exposed directly to users as well. Just as Google offers specialized interfaces like Google Maps, YouTube, and Google Images alongside its main search, your RAG application can offer both:

  1. A natural language interface using the router
  2. Direct access to specialized tools for specific needs

"When I know exactly what I need, a specialized tool is much faster than explaining it to a chatbot. But when I'm exploring new areas or have complex needs, the chat interface helps me discover what's possible."

— Expert User Perspective

Dual-Mode UI Example

Imagine a construction information system that offers:

  • A chat interface for general questions
  • A blueprint search interface with date filters
  • A document search interface with type filters
  • A schedule search with timeline visualization
  • A permit lookup tool with status tracking

These specialized interfaces map directly to the specialized retrievers we've built.

This dual-mode interface has several advantages:

  1. Expert users can go directly to the tool they need
  2. New users can use natural language until they learn the system
  3. User interactions with direct tools provide training data for routing
  4. Clear capabilities help users understand what the system can do
  5. Control and transparency give users confidence in the results
  6. Performance optimization for common, well-defined tasks

UI Implementation Strategy

When implementing a dual-mode interface:

  1. Design specialized interfaces that match your existing tools' parameters
  2. Create a unified entry point that offers both chat and specialized tool options
  3. Add suggestions in chat responses that link to relevant specialized tools
  4. Maintain consistent terminology between chat responses and tool interfaces
  5. Track which interface users prefer for different query types

Specialized Interface Examples

Here's how specialized interfaces might look for our construction information system:

Blueprint Search Interface

<form action="/search/blueprints" method="GET">
  <h2>Blueprint Search</h2>

  <div class="form-group">
    <label for="description">Description:</label>
    <input
      type="text"
      id="description"
      name="description"
      placeholder="e.g., residential building, hospital, school"
    />
  </div>

  <div class="form-group">
    <label for="start-date">Start Date:</label>
    <input type="date" id="start-date" name="start_date" />
  </div>

  <div class="form-group">
    <label for="end-date">End Date:</label>
    <input type="date" id="end-date" name="end_date" />
  </div>

  <button type="submit">Search Blueprints</button>
</form>

Document Search Interface

<form action="/search/documents" method="GET">
  <h2>Document Search</h2>

  <div class="form-group">
    <label for="query">Search Terms:</label>
    <input
      type="text"
      id="query"
      name="query"
      placeholder="e.g., Johnson project, HVAC specifications"
    />
  </div>

  <div class="form-group">
    <label for="document-type">Document Type:</label>
    <select id="document-type" name="document_type">
      <option value="">All Documents</option>
      <option value="contract">Contracts</option>
      <option value="proposal">Proposals</option>
      <option value="bid">Bids</option>
    </select>
  </div>

  <button type="submit">Search Documents</button>
</form>

These interfaces directly map to the tool interfaces we defined earlier, providing users with a clear, structured way to access the same capabilities available to the language model.

The key insight is that RAG isn't just about adding chat to your product—it's about building a comprehensive information discovery system where chat is just one interface option among many specialized tools that help users access information efficiently.

Beyond Simple Forms

These specialized interfaces don't have to be simple forms. They can include rich visualizations, interactive elements, and specialized displays for different content types. For example, a blueprint search might display results on a timeline or a map, while a document search might offer faceted filters and previews. The key is that they map directly to your underlying retrieval tools.

User Feedback as Training Data

A particularly valuable aspect of direct tool access is that user interactions can provide high-quality training data for improving both retrieval and routing:

  1. When users select a specific tool, that's a signal about their intent
  2. When users click on search results, that's a signal about relevance
  3. When users refine their search, that's a signal about what was missing
  4. When users explicitly rate or save results, that's direct feedback on quality

User Feedback Collection Mechanisms

To maximize the value of user feedback, consider implementing:

  • Tool Selection Tracking: Record which specialized tool a user chooses for each query
  • Click Tracking: Monitor which search results users engage with
  • Query Refinement Analysis: Capture how users modify queries that didn't yield useful results
  • Explicit Feedback Buttons: Add "Was this helpful?" buttons to results
  • Result Saving: Allow users to save or bookmark useful results
  • Session Analysis: Examine session patterns to identify successful vs. unsuccessful paths

These interactions can be logged and used to:

  • Fine-tune embedding models with user-confirmed relevant documents
  • Improve router accuracy by learning from user tool selections
  • Create better few-shot examples based on successful interactions
  • Prioritize development efforts based on usage patterns
  • Identify gaps in your retrieval capabilities

Implementing a Feedback Loop

Here's how you might implement a feedback collection and utilization system:

def record_user_feedback(user_id, query, selected_tool, results, clicked_result_ids, explicit_rating=None):
    """
    Record user feedback for future training data collection.

    Args:
        user_id: Identifier for the user
        query: The user's original query
        selected_tool: Which tool they used (or 'chat' if they used the chat interface)
        results: The results returned to the user
        clicked_result_ids: Which result IDs the user clicked on
        explicit_rating: Optional explicit rating (1-5) provided by the user
    """
    feedback_entry = {
        "user_id": user_id,
        "timestamp": datetime.now().isoformat(),
        "query": query,
        "selected_tool": selected_tool,
        "results": results,
        "clicked_result_ids": clicked_result_ids,
        "explicit_rating": explicit_rating,
    }

    # Store feedback in database
    feedback_collection.insert_one(feedback_entry)

    # If this was a highly-rated interaction, consider adding it to examples
    if explicit_rating and explicit_rating >= 4:
        consider_adding_to_examples(feedback_entry)

def generate_training_data_from_feedback(min_clicks=1, min_rating=None, date_range=None):
    """
    Generate training data from collected user feedback.

    Args:
        min_clicks: Minimum number of clicks a result must have received
        min_rating: Minimum explicit rating (if available)
        date_range: Optional date range to filter feedback

    Returns:
        Dictionary with router_training_data and retrieval_training_data
    """
    # Query conditions
    conditions = {}
    if min_rating:
        conditions["explicit_rating"] = {"$gte": min_rating}
    if date_range:
        conditions["timestamp"] = {"$gte": date_range[0], "$lte": date_range[1]}

    # Retrieve feedback entries
    feedback_entries = feedback_collection.find(conditions)

    router_examples = []
    retrieval_examples = []

    for entry in feedback_entries:
        # Generate router training examples
        if entry["selected_tool"] != "chat":
            router_examples.append({
                "query": entry["query"],
                "tool": entry["selected_tool"]
            })

        # Generate retrieval training examples
        for result_id in entry["clicked_result_ids"]:
            if len(entry["clicked_result_ids"]) >= min_clicks:
                retrieval_examples.append({
                    "query": entry["query"],
                    "relevant_doc_id": result_id
                })

    return {
        "router_training_data": router_examples,
        "retrieval_training_data": retrieval_examples
    }

def update_few_shot_examples(router_examples, max_examples_per_tool=5):
    """
    Update the few-shot examples used in the router based on user feedback.

    Args:
        router_examples: Router examples generated from feedback
        max_examples_per_tool: Maximum number of examples to keep per tool
    """
    # Group examples by tool
    examples_by_tool = {}
    for example in router_examples:
        tool = example["tool"]
        if tool not in examples_by_tool:
            examples_by_tool[tool] = []
        examples_by_tool[tool].append(example)

    # Select the best examples for each tool
    selected_examples = []
    for tool, examples in examples_by_tool.items():
        # Sort by frequency or other quality metric
        sorted_examples = sort_examples_by_quality(examples)
        selected_examples.extend(sorted_examples[:max_examples_per_tool])

    # Update the router's few-shot examples
    update_router_prompt(selected_examples)

This creates another improvement flywheel: as users interact with the system, it collects data that makes both retrieval and routing better, which leads to higher user satisfaction and more interactions.

Feedback Biases

Be aware of potential biases in user feedback:

1. **Position bias**: Users tend to click on top results regardless of relevance
2. **Interface bias**: Different interfaces encourage different interaction patterns
3. **User expertise bias**: Expert users interact differently than novices
4. **Success bias**: Successful interactions generate more feedback than failures

To mitigate these biases:
- Occasionally randomize result ordering for evaluation
- Analyze feedback separately across user expertise levels
- Specifically seek feedback on unsuccessful interactions
- Complement implicit feedback with explicit ratings

Success Formula

System success depends on two factors:

P(\\text{success}) = P(\\text{find right document} \\mid \\text{right tool}) \\times P(\\text{right tool})

This means:

  1. Each retriever must work well when selected
  2. The router must select the right retriever

Diagnostic Framework

This formula helps diagnose problems:

  • Low tool selection recall → improve routing
  • Low retrieval recall → improve specific retriever

Example: Imagine users report that when asking about blueprints, they only get satisfactory answers 40% of the time. There are two very different scenarios that could cause this:

Scenario 1: The router correctly selects the blueprint search tool 95% of the time, but the blueprint search itself only finds the right blueprints 42% of the time.

  • P(right tool) = 0.95
  • P(find right document | right tool) = 0.42
  • P(success) = 0.95 × 0.42 = 0.40 (40%)

Scenario 2: The blueprint search is excellent at finding the right blueprints 80% of the time when used, but the router only selects it 50% of the time (often choosing document search instead).

  • P(right tool) = 0.50
  • P(find right document | right tool) = 0.80
  • P(success) = 0.50 × 0.80 = 0.40 (40%)

Same 40% success rate, but completely different problems requiring different solution strategies:

For Scenario 1 (retrieval problem):

  • Generate synthetic data to improve the blueprint search capability
  • Fine-tune embedding models specifically for blueprint content
  • Improve the extraction and structuring of blueprint metadata
  • Experiment with different chunking strategies for blueprints

For Scenario 2 (routing problem):

  • Add more few-shot examples showing when to use the blueprint tool
  • Improve the blueprint tool description to make it more distinctive
  • Add user feedback from successful interactions into your examples
  • Consider UI changes to help users explicitly request blueprints

Independent Measurement

Measure separately:

  • Per-tool recall: Retriever success rate when used
  • Tool selection accuracy: Router success rate

A dashboard with both metrics shows where to focus.

From Metrics to Roadmap

This formula provides a clear framework for planning both product and research efforts:

P(success | right tool) P(right tool | query) Strategy
High High These are strengths to highlight in your product
Low High Research focus needed on specific retrievers
High Low Focus on improving router or exposing tools directly
Low Low Consider whether this query type is worth supporting

Systematic measurement and improvement of both components creates a continuous improvement cycle.

Summary

This book covered systematic RAG improvement:

  1. Synthetic data for evaluation
  2. Converting evaluations to training data
  3. Feedback collection through UX
  4. User segmentation and analysis
  5. Specialized retrieval capabilities
  6. Unified architecture with routing

The result: a system that retrieves the right information using the right specialized capabilities.

Core principle: Synthetic data and customer feedback are the fundamental building blocks. Everything else is implementation details that will evolve.

The Improvement Process

  1. Measure performance by component
  2. Identify limiting factors
  3. Generate synthetic test data
  4. Implement targeted improvements
  5. Collect user feedback
  6. Repeat continuously

This process works for first-time builders and experienced teams alike. Tools change; the process remains.