Chapter 0: Introduction - The Product Mindset for RAG

Chapter at a Glance

Prerequisites: Basic understanding of LLMs and familiarity with building software applications

What You Will Learn:

Foundational RAG concepts (embeddings, vector databases, chunking)
The product mindset vs implementation mindset
The improvement flywheel framework
Common failure patterns and how to avoid them

Case Study Reference: Legal tech company case study demonstrating systematic improvement from 63% to 87% accuracy

Time to Complete: 45-60 minutes

Key Insight

Successful RAG systems are not projects that ship once—they are products that improve continuously. The difference between teams that succeed and those that fail is not the embedding model or vector database they choose. It is whether they treat RAG as a living product that learns from every user interaction, or as a static implementation that slowly decays in production.

Learning Objectives

By the end of this chapter, you will be able to:

Explain the difference between an implementation mindset and a product mindset for RAG systems
Describe how embeddings, vector databases, and chunking work together in RAG
Distinguish between semantic and lexical search and know when to use each
Apply the improvement flywheel framework to your own RAG applications
Identify whether a problem is an inventory problem or a capability problem
Recognize common failure patterns that cause RAG systems to underperform

Introduction

After a decade building AI systems, the same pattern repeats: teams ship a RAG system, celebrate the launch, then watch it slowly fail in production. User questions evolve. Data distributions shift. Edge cases multiply. Within weeks, the system that worked perfectly in demos struggles with real queries.

This book shows how to avoid that trap. The most successful RAG systems are not the ones with the fanciest embeddings or the biggest context windows—they are the ones that get better every week based on what users actually do with them. They treat deployment as the beginning of improvement, not the end of development.

For Product Managers

This chapter establishes the mental models you need to lead RAG initiatives effectively. Focus on the improvement flywheel, the inventory vs capability distinction, and how to measure success. You can skim the technical sections on embeddings and vector databases, but understanding the business implications of each concept will help you make better decisions.

For Engineers

This chapter provides foundational concepts you will use throughout the book. Pay special attention to the technical sections on embeddings, vector databases, and chunking. These concepts appear in every subsequent chapter, so building strong intuition here will pay dividends later.

Core Content

Foundational Concepts

Before diving into improvement strategies, you need to understand the building blocks of RAG systems. These concepts appear throughout this book, so we introduce them here with both business context and technical depth.

Embeddings and Vector Representations

Embeddings are the foundation of modern RAG systems. They transform text into numerical representations that capture meaning.

For Product Managers

What embeddings mean for your business: Embeddings determine how well your system understands user intent. Poor embeddings mean users cannot find what they need, even when the answer exists in your knowledge base. When evaluating RAG vendors or approaches, ask: "How well do these embeddings capture the vocabulary and concepts specific to our domain?"

**Key decision point**: Off-the-shelf embeddings work well for general content. Domain-specific content (legal, medical, technical) often benefits from fine-tuned embeddings. The ROI of fine-tuning depends on how specialized your vocabulary is.

For Engineers

How embeddings work: An embedding model converts text into a dense vector (typically 384-1536 dimensions). Similar texts produce similar vectors, measured by cosine similarity.

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# These will have similar embeddings
text1 = "The contract expires on December 31st"
text2 = "The agreement terminates at year end"

embedding1 = model.encode(text1)
embedding2 = model.encode(text2)

# Cosine similarity will be high (close to 1.0)
```

**Vector space intuition**: Think of embeddings as coordinates in a high-dimensional space. Words and phrases with similar meanings cluster together. "Contract" and "agreement" are close. "Contract" and "banana" are far apart.

**Cosine similarity**: Measures the angle between two vectors, ranging from -1 (opposite) to 1 (identical). For normalized embeddings, this is equivalent to the dot product. Most RAG systems consider similarity above 0.7-0.8 as "relevant."

graph LR
    subgraph "Vector Space"
        A["contract expiration"] --> B["agreement termination"]
        A --> C["legal deadline"]
        D["banana recipe"] --> E["fruit smoothie"]
        D --> F["cooking instructions"]
    end
    style A fill:#9f9
    style B fill:#9f9
    style C fill:#9f9
    style D fill:#99f
    style E fill:#99f
    style F fill:#99f

Vector Databases

Vector databases store embeddings and enable fast similarity search at scale.

For Product Managers

Why vector databases matter: Traditional databases find exact matches. Vector databases find similar matches. This is what enables semantic search—finding documents that mean the same thing as the query, even if they use different words.

**Cost considerations**: Vector databases charge based on storage (number of vectors) and queries per second. For most applications, costs are modest ($50-500/month). Costs increase significantly at scale (millions of documents) or with high query volume.

**Key vendors**: Pinecone (managed, easy to start), Weaviate (open source, flexible), pgvector (PostgreSQL extension, good for existing Postgres users), LanceDB (open source, hybrid search support), Chroma (simple Python API, great for prototyping), Turbopuffer (high-performance, fast at scale).

For Engineers

How vector databases work: They use Approximate Nearest Neighbor (ANN) algorithms to find similar vectors without comparing against every stored vector. Common algorithms include HNSW (Hierarchical Navigable Small World) and IVF (Inverted File Index).

**Tradeoffs**:

| Factor | Exact Search | ANN Search |
|--------|--------------|------------|
| Speed | O(n) - slow at scale | O(log n) - fast |
| Accuracy | 100% | 95-99% typical |
| Memory | Low | Higher (index structures) |

**When to use what**:

- Small datasets (<10K documents): Exact search is fine
- Medium datasets (10K-1M): ANN with high recall settings
- Large datasets (>1M): ANN with tuned parameters

```python
# Example: LanceDB vector database
import lancedb
from lancedb.pydantic import LanceModel, Vector

class Document(LanceModel):
    id: str
    text: str
    vector: Vector(384)  # Matching embedding dimension

db = lancedb.connect(":memory:")
table = db.create_table("documents", schema=Document)
```

Semantic vs Lexical Search

Understanding when to use semantic search versus lexical search is critical for RAG performance.

For Product Managers

When semantic search wins: Users describe what they want in their own words. "How do I cancel my subscription?" should match documentation titled "Account Termination Process."

**When lexical search wins**: Users search for specific terms, product codes, or exact phrases. "Error code E-4521" should match documents containing that exact string.

**Business implication**: Most production systems need both. Hybrid search combines semantic and lexical approaches. The investment in hybrid search typically pays off when you have both natural language queries and exact-match requirements.

For Engineers

Lexical search (BM25): Ranks documents by term frequency and inverse document frequency. Works well for exact matches, rare terms, and when users know the vocabulary.

```python
# BM25 excels at exact matches
query = "42 U.S.C. § 1983"  # Legal citation
# Lexical search finds exact match
# Semantic search might miss it (citation is not "meaningful")
```

**Semantic search (embeddings)**: Ranks by meaning similarity. Works well for natural language queries, synonyms, and when users do not know exact terminology.

```python
# Semantic search excels at meaning
query = "employee fired unfairly"
# Matches documents about "wrongful termination"
# Lexical search would miss this (different words)
```

**Hybrid approach**: Retrieve with both methods, then combine results using Reciprocal Rank Fusion (RRF) or learned weights.

```python
def hybrid_search(query, alpha=0.5):
    semantic_results = semantic_search(query)
    lexical_results = lexical_search(query)
    return reciprocal_rank_fusion(semantic_results, lexical_results, alpha)
```

Chunking Strategies

Chunking determines how documents are split for retrieval. Poor chunking is one of the most common causes of RAG failures.

For Product Managers

Why chunking matters: Chunks that are too small lose context. Chunks that are too large dilute relevance. The right chunking strategy depends on your content type and query patterns.

**Size considerations**:

- Small chunks (200-500 tokens): Good for precise answers, bad for context
- Medium chunks (500-1000 tokens): Balanced approach for most use cases
- Large chunks (1000-2000 tokens): Good for complex topics, may include irrelevant content

**Key question to ask**: "When users ask questions, how much context do they need to get a useful answer?"

For Engineers

Common chunking strategies:

| Strategy | Best For | Tradeoffs |
|----------|----------|-----------|
| Fixed-size | Simple documents, consistent format | May split mid-sentence |
| Sentence-based | Conversational content | Chunks vary in size |
| Semantic | Complex documents | Computationally expensive |
| Page-level | PDFs, structured documents | Large chunks, may miss details |
| Hierarchical | Long documents, books | Complex to implement |

**Implementation example**:

```python
def recursive_character_split(
    text: str,
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
    separators: list[str] = ["\n\n", "\n", ". ", " ", ""]
) -> list[str]:
    """
    Split text recursively by separators, respecting chunk size and overlap.

    Args:
        text: Text to split
        chunk_size: Maximum size of each chunk
        chunk_overlap: Number of characters to overlap between chunks
        separators: List of separators to try, in order of preference

    Returns:
        List of text chunks
    """
    if len(text) <= chunk_size:
        return [text]

    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size

        if end >= len(text):
            chunks.append(text[start:])
            break

        # Try to find a good split point using separators
        split_pos = end
        for separator in separators:
            if separator:
                # Look backwards from end for separator
                pos = text.rfind(separator, start, end)
                if pos != -1:
                    split_pos = pos + len(separator)
                    break

        chunk = text[start:split_pos]
        chunks.append(chunk)

        # Move start forward with overlap
        start = split_pos - chunk_overlap
        if start < 0:
            start = 0

    return chunks

# Usage: Split document with semantic boundaries and overlap
chunks = recursive_character_split(
    document,
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)
```

**Critical insight from production**: Many tutorials recommend tiny chunks (200 characters) based on outdated advice for models with limited context windows. Modern models handle larger chunks well. In one e-commerce implementation, small chunks meant no single chunk contained complete product information, causing 13% hallucination rate.

The Alignment Problem

The alignment problem is one of the most overlooked causes of RAG failures.

For Product Managers

What alignment means: Your embeddings must align with how users search. If you embed product descriptions but users search by purchase patterns, retrieval will fail even with perfect embeddings.

**Business impact**: Misalignment causes the frustrating situation where "the answer is in there somewhere" but the system cannot find it. Users lose trust quickly.

**How to detect**: Compare what you embed (document content) with what users search for (query patterns). If they use different vocabulary or concepts, you have an alignment problem.

For Engineers

Technical explanation: Embedding models are trained on specific tasks. Most are trained for semantic similarity between similar texts. But RAG requires matching questions to answers—a different task.

**Example of misalignment**:

```python
# Document: "The Model X features autopilot and 300-mile range"
# Query: "Which Tesla should I buy for long road trips?"

# The query is about use case (road trips)
# The document is about features (range)
# Standard embeddings may not connect these well
```

**Solutions**:

1. Query expansion: Transform queries to look more like documents
2. Hypothetical document embeddings (HyDE): Generate a hypothetical answer, embed that
3. Fine-tuning: Train embeddings on your specific query-document pairs
4. Contextual retrieval: Add context to chunks at indexing time

Inventory vs Capability Problem

Before optimizing your RAG system, determine whether you have an inventory problem or a capability problem.

For Product Managers

Strategic distinction: These problems require completely different solutions. Investing in better retrieval when you lack content wastes resources. Adding content when retrieval is broken wastes resources.

**Inventory problem**: The answer does not exist in your knowledge base

- Missing documents entirely
- Outdated information
- Gaps in content coverage
- **Solution**: Add or update content

**Capability problem**: The answer exists but the system cannot find it

- Poor retrieval failing to match query to document
- Wrong search strategy for the query type
- Inability to understand query intent
- **Solution**: Improve retrieval, understanding, or routing

For Engineers

How to diagnose:

```python
def diagnose_failure(query, expected_document):
    # Step 1: Does the document exist?
    if not document_exists(expected_document):
        return "INVENTORY_PROBLEM"

    # Step 2: Can a human find it?
    if human_can_find(query, expected_document):
        return "CAPABILITY_PROBLEM"

    # Step 3: Is the content sufficient?
    if not content_answers_query(expected_document, query):
        return "INVENTORY_PROBLEM"  # Content exists but does not answer

    return "CAPABILITY_PROBLEM"
```

**Real example**: A team spent months improving their embedding model when the actual problem was that 21% of their documents were silently dropped during ingestion due to encoding issues. Always verify your inventory before optimizing capabilities.

The Product Mindset

When organizations implement RAG systems, they often approach it as a purely technical challenge. They focus on selecting the right embedding model, vector database, and LLM, then consider the project "complete" once these components are integrated and deployed.

This approach inevitably leads to disappointment.

For Product Managers

Why product thinking matters: RAG systems serve users, not benchmarks. The teams that succeed are the ones that measure user outcomes, not just technical metrics. Your role is to ensure the team stays focused on user value, not technical elegance.

**ROI of systematic improvement**: Teams with a systematic approach ship improvements weekly. Teams without one spend months debating what might work. The compound effect is dramatic—after six months, systematic teams are 10x better than ad-hoc teams.

For Engineers

How to apply product thinking: Every technical decision should connect to user outcomes. When evaluating a new embedding model, do not just look at benchmark scores—test it on your actual queries and measure user-facing metrics.

**The mental shift**:

| Old Question | New Question |
|--------------|--------------|
| Which embedding model has the best benchmark? | Which embedding helps users find answers fastest? |
| What is the optimal chunk size? | How do we know if chunking helps or hurts users? |
| How do we eliminate hallucinations? | How do we build trust even when imperfect? |
| Should we use GPT-4 or Claude? | Which model capabilities matter for our use case? |

Here is how to identify which mindset a team has:

Implementation Mindset:

"We need to implement RAG"
Obsessing over embedding dimensions and context windows
Success = it works in the demo
Big upfront architecture decisions
Focus on picking the "best" model

Product Mindset:

"We need to help users find answers faster"
Tracking answer relevance and task completion
Success = users keep coming back
Architecture that can evolve
Focus on learning from user behavior

The Improvement Flywheel

The improvement flywheel is the framework that transforms RAG from a static implementation into a continuously improving product.

graph TD
    A[Build Basic RAG] --> B[Create Synthetic Evaluation Data]
    B --> C[Define Metrics]
    C --> D[Test Hypotheses]
    D --> E[Deploy & Collect Real User Feedback]
    E --> F[Categorize & Analyze User Questions]
    F --> G[Make Targeted Improvements]
    G --> H[Implement Monitoring]
    H --> B

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style E fill:#bbf,stroke:#333,stroke-width:2px
    style G fill:#dfd,stroke:#333,stroke-width:2px

For Product Managers

Business value of the flywheel: Each rotation makes the next one faster. More data leads to better insights, which lead to smarter improvements, which generate more engaged users who provide better data. After 3-4 rotations, teams report 50% reduction in time-to-improvement.

**Key metrics to track at each stage**:

| Stage | Leading Metric | Lagging Metric |
|-------|----------------|----------------|
| Cold Start | Evaluation coverage | N/A |
| Deployment | Feedback rate | User satisfaction |
| Growth | Improvement velocity | Retention |
| Optimization | Cost per query | Revenue impact |

For Engineers

Technical implementation of the flywheel:

**Stage 1 - Synthetic Data**: Generate questions from your content to bootstrap evaluation.

```python
def generate_synthetic_question(chunk: str) -> str:
    prompt = f"""Generate a question that would be answered by this text:

    {chunk}

    Question:"""
    return llm.generate(prompt)
```

**Stage 2 - Metrics**: Implement precision, recall, and MRR tracking.

**Stage 3 - Feedback**: Instrument your application for data collection.

**Stage 4 - Analysis**: Use clustering to identify query patterns.

**Stage 5 - Improvement**: Build specialized retrievers for high-impact segments.

**Stage 6 - Monitoring**: Track performance continuously, alert on regressions.

The flywheel solves real problems at each stage:

Phase	Business Challenge	Technical Challenge	Flywheel Solution
Cold Start	No data to guide design decisions	No examples to train or evaluate against	Generate synthetic questions from content, establish baseline metrics
Initial Deployment	Understanding what users actually need	Learning what causes poor performance	Instrument application for data collection, implement feedback mechanisms
Growth	Prioritizing improvements with limited resources	Addressing diverse query types effectively	Use topic modeling to segment questions, identify highest-impact opportunities
Optimization	Maintaining quality as usage scales	Combining multiple specialized components	Create unified routing architecture, implement monitoring and alerts

Common Failure Patterns

Understanding why RAG systems fail helps you avoid the same mistakes.

PM Pitfall

Treating RAG as a project, not a product: The most common strategic mistake is declaring victory after launch. RAG systems decay without continuous improvement. Plan for ongoing investment from the start.

**Optimizing the wrong metric**: Teams often optimize for retrieval metrics (precision, recall) while ignoring user outcomes (task completion, satisfaction). A system with 95% precision but 10% task completion is failing.

**Ignoring the cold start**: Without evaluation data, you cannot measure improvement. Teams that skip synthetic data generation spend months guessing what might work.

Engineering Pitfall

Silent data loss: In one medical chatbot project, 21% of documents were silently dropped due to encoding issues. The team spent months debugging retrieval when the problem was missing data. Always monitor document counts at each pipeline stage.

**Chunking too small**: Many implementations use tiny chunks (200 characters) because they follow outdated tutorials. This dilutes context and causes hallucinations. Test chunk sizes on your actual queries.

**Naive embedding usage**: Most embeddings are trained for semantic similarity between similar texts, not for matching questions to answers. Consider query expansion, HyDE, or fine-tuning.

**Index staleness**: In a financial news system, the index had not been refreshed for two weeks, causing outdated earnings reports to be returned. Monitor index freshness for time-sensitive applications.

**Accepting vague queries**: Queries like "health tips" force your system to retrieve broadly. Implement query classification to detect and handle low-information queries.

How to Use This Book

This book is designed for two audiences: Product Managers and Engineers. Each chapter includes content for both, clearly marked with admonitions.

Reading Paths

For Product Managers:

Read the "Key Insight" and "Learning Objectives" for each chapter
Focus on sections marked "For Product Managers"
Skim technical sections for context, but do not worry about implementation details
Pay close attention to "Common Pitfalls" and "Action Items"

For Engineers:

Read chapters in order—concepts build on each other
Focus on sections marked "For Engineers"
Work through code examples in your own environment
Use the appendices for mathematical deep dives

For Full Understanding:

Read everything in order
Work through code examples
Complete reflection questions
Apply concepts to your own RAG system

Admonition Types

Throughout this book, you will see these callout boxes:

For Product Managers

Strategic insights, business implications, and decision frameworks for product leaders.

For Engineers

Technical details, implementation guidance, and code examples for developers.

PM Pitfall

Strategic mistakes that product teams commonly make.

Engineering Pitfall

Technical mistakes that engineering teams commonly make.

Info

General information relevant to all readers.

Example

Concrete examples illustrating concepts.

Success Story

Success stories and positive outcomes.

Each chapter follows the same structure:

Chapter at a Glance: Prerequisites, outcomes, time estimate
Key Insight: One-paragraph summary
Learning Objectives: What you will be able to do
Introduction: Context and motivation
Core Content: Main material with PM and Engineer sections
Case Study Deep Dive: Real-world application
Implementation Guide: Step-by-step instructions
Common Pitfalls: Mistakes to avoid
Related Content: Links to talks, transcripts, office hours
Action Items: Next steps for PM and Engineering teams
Reflection Questions: Self-assessment
Summary: Key takeaways
Further Reading: Academic papers and tools
Related Chapters: Links to related content
Navigation: Links to previous and next chapters

Case Study Deep Dive

Legal Tech Company: From 63% to 87% Accuracy

A legal tech company building case law search provides a concrete example of the improvement flywheel in action.

For Product Managers

Business outcomes:

- Research time reduced by 40%
- Lawyers began using the system daily (vs. occasional use before)
- Clear roadmap for continued improvement
- Engineering time allocated based on data, not opinions

**Key decisions that drove success**:

1. Invested in synthetic data generation before launch
2. Implemented feedback collection from day one
3. Used query clustering to identify high-impact segments
4. Built specialized retrievers for distinct query types

For Engineers

Technical implementation timeline:

**Month 1 - Baseline**: Basic RAG with standard embeddings. Generated 200 test queries from case law. Baseline accuracy: 63%.

**Month 2 - First Iteration**: Testing revealed legal citations like "42 U.S.C. § 1983" were being split across chunks. Fixed chunking to respect citation patterns. Accuracy: 72%.

**Month 3 - Deployment**: Shipped with thumbs up/down feedback. Tracked which answers lawyers copied into briefs.

**Months 4-5 - Pattern Discovery**: After 5,000 queries, three distinct patterns emerged:

| Query Type | Volume | Accuracy | Status |
|------------|--------|----------|--------|
| Case citations | 40% | 91% | Working well |
| Legal definitions | 35% | 78% | Acceptable |
| Procedural questions | 25% | 34% | Failing |

**Month 6 - Specialized Solutions**: Built dedicated retrieval for each type:

- Case citations: Exact matching on citation format
- Definitions: Specialized glossary index
- Procedural questions: Separate index from court rules

Overall accuracy: 87%.

**Ongoing**: Monitoring revealed procedural questions growing 3x faster than other types, directing engineering focus for the next quarter.

Implementation Guide

Quick Start for PMs

Week 1: Establish Baseline

Define what "success" means for your RAG system (task completion? time saved?)
Identify 3-5 key user journeys to measure
Work with engineering to generate synthetic evaluation data
Establish baseline metrics

Week 2: Instrument for Learning

Add feedback collection (thumbs up/down minimum)
Set up query logging
Create a dashboard for key metrics
Schedule weekly metric reviews

Week 3: Begin Improvement Cycle

Review first week of feedback data
Identify top 3 failure patterns
Prioritize based on user impact and engineering effort
Start first improvement sprint

Ongoing: Maintain the Flywheel

Weekly: Review metrics, identify issues
Monthly: Analyze query clusters, update priorities
Quarterly: Assess overall progress, adjust strategy

Detailed Implementation for Engineers

Step 1: Generate Synthetic Evaluation Data

import asyncio
from typing import List, Tuple

async def generate_eval_dataset(
    chunks: List[str],
    llm_client,
    num_questions_per_chunk: int = 2
) -> List[Tuple[str, str, str]]:
    """Generate (question, chunk_id, expected_answer) tuples."""

    async def generate_for_chunk(chunk: str, chunk_id: str):
        prompt = f"""Based on this text, generate {num_questions_per_chunk} questions
        that could be answered using this information.

        Text: {chunk}

        Format each question on a new line."""

        response = await llm_client.generate(prompt)
        questions = response.strip().split('\n')
        return [(q, chunk_id, chunk) for q in questions if q.strip()]

    tasks = [
        generate_for_chunk(chunk, f"chunk_{i}")
        for i, chunk in enumerate(chunks)
    ]
    results = await asyncio.gather(*tasks)
    return [item for sublist in results for item in sublist]

Step 2: Implement Evaluation Metrics

from typing import List, Dict
import numpy as np

def calculate_retrieval_metrics(
    queries: List[str],
    expected_chunks: List[str],
    retrieved_chunks: List[List[str]],
    k: int = 5
) -> Dict[str, float]:
    """Calculate precision, recall, and MRR at k."""

    precisions = []
    recalls = []
    reciprocal_ranks = []

    for expected, retrieved in zip(expected_chunks, retrieved_chunks):
        retrieved_k = retrieved[:k]

        # Precision: relevant retrieved / total retrieved
        relevant_retrieved = 1 if expected in retrieved_k else 0
        precisions.append(relevant_retrieved / len(retrieved_k))

        # Recall: relevant retrieved / total relevant (1 in this case)
        recalls.append(relevant_retrieved)

        # MRR: 1 / rank of first relevant result
        if expected in retrieved_k:
            rank = retrieved_k.index(expected) + 1
            reciprocal_ranks.append(1 / rank)
        else:
            reciprocal_ranks.append(0)

    return {
        "precision_at_k": np.mean(precisions),
        "recall_at_k": np.mean(recalls),
        "mrr": np.mean(reciprocal_ranks)
    }

Step 3: Set Up Feedback Collection

from datetime import datetime
from pydantic import BaseModel, Field
from enum import Enum

class FeedbackType(str, Enum):
    THUMBS_UP = "thumbs_up"
    THUMBS_DOWN = "thumbs_down"

class QueryFeedback(BaseModel):
    query_id: str
    query_text: str
    response_text: str
    retrieved_chunks: List[str]
    feedback_type: FeedbackType
    feedback_text: str | None = None
    timestamp: datetime = Field(default_factory=datetime.now)
    user_id: str | None = None

async def log_feedback(feedback: QueryFeedback, db_client):
    """Log feedback for analysis."""
    await db_client.insert("feedback", feedback.model_dump())

Step 4: Implement Query Clustering

from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer

def cluster_queries(
    queries: List[str],
    n_clusters: int = 10
) -> Dict[int, List[str]]:
    """Cluster queries to identify patterns."""

    model = SentenceTransformer('all-MiniLM-L6-v2')
    embeddings = model.encode(queries)

    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    labels = kmeans.fit_predict(embeddings)

    clusters = {}
    for query, label in zip(queries, labels):
        if label not in clusters:
            clusters[label] = []
        clusters[label].append(query)

    return clusters

Common Pitfalls

PM Pitfalls

PM Pitfall: Declaring Victory Too Early

The mistake: Celebrating launch and moving the team to other projects.

**Why it happens**: RAG demos well. Stakeholders see impressive results and assume the work is done.

**The consequence**: System performance degrades as user patterns evolve and content becomes stale.

**How to avoid**: Plan for ongoing investment from the start. Budget engineering time for continuous improvement, not just initial development.

PM Pitfall: Optimizing Vanity Metrics

The mistake: Focusing on retrieval metrics (precision, recall) while ignoring user outcomes.

**Why it happens**: Retrieval metrics are easy to measure. User outcomes require more instrumentation.

**The consequence**: A system that looks good on paper but fails users in practice.

**How to avoid**: Always connect technical metrics to user outcomes. If precision improves but task completion does not, something is wrong.

PM Pitfall: Skipping the Cold Start

The mistake: Deploying without evaluation data, planning to "learn from users."

**Why it happens**: Generating synthetic data feels like extra work before launch.

**The consequence**: No baseline to measure against. Months of guessing what might work.

**How to avoid**: Invest in synthetic data generation before launch. Even 100 synthetic queries provide valuable signal.

Engineering Pitfalls

Engineering Pitfall: Silent Data Loss

The mistake: Not monitoring document counts through the ingestion pipeline.

**Why it happens**: Errors are caught and logged, but not aggregated or alerted on.

**The consequence**: Significant portions of your knowledge base are missing without anyone knowing.

**How to avoid**: Track document counts at each pipeline stage. Alert when counts drop unexpectedly. In one case, 21% of documents were silently dropped due to encoding issues.

Engineering Pitfall: Following Outdated Tutorials

The mistake: Using tiny chunks (200 characters) because a tutorial said so.

**Why it happens**: Many tutorials were written for models with limited context windows.

**The consequence**: Chunks too small to contain meaningful information, leading to hallucinations.

**How to avoid**: Test chunk sizes on your actual queries. Modern models handle larger chunks well.

Engineering Pitfall: Ignoring the Alignment Problem

The mistake: Assuming standard embeddings will work for your domain.

**Why it happens**: Embeddings work well on benchmarks, so they should work everywhere.

**The consequence**: Users cannot find answers even when they exist, because queries and documents are embedded differently.

**How to avoid**: Compare query patterns to document content. If they use different vocabulary or concepts, consider query expansion, HyDE, or fine-tuning.

Key Insights from the Course

"The goal is not to build AGI—the goal is to create economically valuable work."
"Success is defined by doing the most obvious thing over and over again. Your only job is to apply consistent effort."
"Think about building a recommendation system wrapped around a language model, not just retrieval-augmented generation."

Talk: RAG Antipatterns (Skylar Payne)

Full talk available at docs/talks/rag-antipatterns-skylar-payne.md. Key insights:

Data quality is foundational: "Look at your data at every step of the process. Start from understanding what your users want, work backwards."
Silent failures are dangerous: In one implementation, 21% of documents were silently dropped due to encoding issues.
Evaluate before adding complexity: "About 90% of the time, teams implement complex retrieval paths without evaluating if they actually improve performance."
The teams that win iterate fastest: "The teams who can make that loop go as fast as possible are the ones who win."

Office Hours

Relevant office hours sessions:

Cohort 2 Week 1: Discussion of product mindset, cold start problems
Cohort 3 Week 1: Deep dive on evaluation-first development

Action Items

For Product Teams

This week: Define success metrics for your RAG system that connect to user outcomes
This month: Implement feedback collection (minimum: thumbs up/down)
This quarter: Establish the improvement flywheel with weekly metric reviews
Ongoing: Maintain a prioritized backlog based on query cluster analysis

For Engineering Teams

This week: Generate synthetic evaluation data from your content (minimum 100 queries)
This month: Implement retrieval metrics (precision, recall, MRR) and establish baselines
This quarter: Set up query logging and clustering to identify patterns
Ongoing: Monitor document counts, index freshness, and retrieval performance

Reflection Questions

Is your current RAG system treated as a completed project or an evolving product? What would need to change to shift toward a product mindset?
Do you have mechanisms in place to learn from user interactions? If not, what is the simplest feedback mechanism you could implement this week?
How do you currently measure success? Are your metrics leading indicators (actionable) or lagging indicators (outcomes)?
Think of a recent RAG failure. Was it an inventory problem (missing content) or a capability problem (content exists but was not found)? How would you diagnose this systematically?
If you had to improve your RAG system by 20% in the next month, where would you focus? What data would you need to make that decision confidently?

Summary

Key Takeaways for Product Managers

RAG systems are products, not projects. Plan for continuous improvement from day one.
The improvement flywheel (synthetic data → metrics → feedback → analysis → improvement) is your roadmap.
Distinguish between inventory problems (missing content) and capability problems (cannot find existing content)—they require different solutions.
Connect technical metrics to user outcomes. Precision and recall matter only if they improve task completion.
The teams that iterate fastest win. Build systems that enable rapid experimentation.

Key Takeaways for Engineers

Foundational concepts (embeddings, vector databases, chunking) appear throughout RAG systems. Build strong intuition here.
The alignment problem (mismatch between how you embed and how users search) is a common cause of failures.
Silent data loss is more common than you think. Monitor document counts at every pipeline stage.
Do not follow outdated tutorials blindly. Test chunk sizes, embedding models, and retrieval strategies on your actual queries.
Implement evaluation before adding complexity. Most "improvements" make things worse when properly measured.

Chapter 0: Introduction - The Product Mindset for RAG

Chapter at a Glance

Key Insight

Learning Objectives

Introduction

Core Content

Foundational Concepts

Embeddings and Vector Representations

Vector Databases

Semantic vs Lexical Search

Chunking Strategies

The Alignment Problem

Inventory vs Capability Problem

The Product Mindset

The Improvement Flywheel

Common Failure Patterns

How to Use This Book

Reading Paths

Admonition Types

Navigation Guide

Case Study Deep Dive

Legal Tech Company: From 63% to 87% Accuracy

Implementation Guide

Quick Start for PMs

Detailed Implementation for Engineers

Common Pitfalls

PM Pitfalls

Engineering Pitfalls

Related Content

Key Insights from the Course

Talk: RAG Antipatterns (Skylar Payne)

Office Hours

Action Items

For Product Teams

For Engineering Teams

Reflection Questions

Summary

Key Takeaways for Product Managers

Key Takeaways for Engineers

Further Reading

Academic Papers

Tools and Libraries

Related Chapters

Navigation