Chapter 3: Feedback Systems and UX
Chapter at a Glance
Prerequisites: Chapter 1 (evaluation framework), Chapter 2 (fine-tuning basics), basic web development
What You Will Learn:
- How to design feedback mechanisms that collect 5x more data
- Streaming techniques that reduce perceived latency by 40%
- Citation patterns that build trust and generate training data
- Chain-of-thought reasoning for 15-20% accuracy improvements
- Validation patterns that catch errors before users see them
Case Study Reference: Zapier (10 to 40 feedback submissions/day), Legal research team (50,000+ labeled examples from citations)
Time to Complete: 60-90 minutes
Key Insight
Good copy beats good UI—changing "How did we do?" to "Did we answer your question?" increases feedback rates by 5x. The difference between 0.1% and 0.5% feedback is not just more data—it is the difference between flying blind and having a clear view of what works. Design every user interaction to potentially generate training data, stream everything to maintain engagement, and add validation layers that catch errors before they reach users.
Learning Objectives
By the end of this chapter, you will be able to:
- Design high-visibility feedback mechanisms that increase collection rates from 0.1% to 0.5%
- Implement streaming responses that make users perceive systems as 40% faster
- Build interactive citation systems that generate training data from every interaction
- Apply chain-of-thought reasoning to improve accuracy by 15-20%
- Create validation patterns that catch errors before users see them
- Know when to strategically reject work to build trust
Introduction
In Chapter 1, we established evaluation frameworks with synthetic data. In Chapter 2, we learned how to convert that data into fine-tuned models. Now comes the critical question: how do you collect real user data to fuel the improvement flywheel?
Most RAG implementations focus exclusively on retrieval and generation while neglecting the infrastructure needed to collect and utilize user feedback. This is a mistake. Without robust feedback mechanisms, you are flying blind—unable to identify which aspects of your system perform well and which need enhancement.
This chapter covers three interconnected topics:
- Feedback Collection: How to design mechanisms that collect 5x more data
- Streaming and Perceived Performance: How to maintain engagement during processing
- Quality of Life Improvements: Citations, chain of thought, and validation patterns
Each topic reinforces the others. Streaming keeps users engaged long enough to provide feedback. Citations create natural touchpoints for feedback collection. Validation ensures the feedback you collect reflects actual system quality, not random errors.
For Product Managers
This chapter establishes the user experience foundation for continuous improvement. Focus on the business impact of feedback collection rates, the ROI of streaming implementation, and how citations build trust. The technical implementation details matter less than understanding what each technique enables.
For Engineers
This chapter provides practical implementation patterns you will use daily. Pay close attention to the streaming code examples, citation formats, and validation patterns. These techniques directly impact both user experience and your ability to collect training data.
Core Content
Feedback Collection: Building Your Improvement Flywheel
The first principle of effective feedback collection is visibility. Your feedback mechanisms should be prominent and engaging, not hidden in dropdown menus or settings pages.
For Product Managers
Why feedback collection matters: Every piece of user feedback is potential training data. At 0.1% feedback rate, a system with 10,000 daily queries generates 10 labeled examples per day. At 0.5%, that same system generates 50 examples—enough to fine-tune models 5x faster.
Real numbers from production systems:
- Zapier increased feedback from 10 to 40+ submissions per day with better copy
- 90% of follow-up emails accepted without edits when using structured feedback
- 35% reduction in escalation rates when feedback gets specific
- 5x more feedback with enterprise Slack integrations
The copy that works:
| Bad Copy | Good Copy |
|---|---|
| "How did we do?" | "Did we answer your question?" |
| "Rate your experience" | "Did this code solve your problem?" |
| "Give feedback" | "Did we take the correct actions?" |
The key is focusing on your core value proposition rather than generic satisfaction.
For Engineers
Implementing high-visibility feedback:
from pydantic import BaseModel
from enum import Enum
from typing import Optional
from datetime import datetime
class FeedbackType(str, Enum):
POSITIVE = "positive"
NEGATIVE = "negative"
PARTIAL = "partial"
class NegativeFeedbackReason(str, Enum):
TOO_SLOW = "too_slow"
WRONG_INFORMATION = "wrong_information"
BAD_FORMAT = "bad_format"
MISSING_INFORMATION = "missing_information"
IRRELEVANT_SOURCES = "irrelevant_sources"
class UserFeedback(BaseModel):
query_id: str
feedback_type: FeedbackType
negative_reason: Optional[NegativeFeedbackReason] = None
free_text: Optional[str] = None
timestamp: datetime
user_id: Optional[str] = None
async def collect_feedback(
query_id: str,
feedback_type: FeedbackType,
negative_reason: Optional[NegativeFeedbackReason] = None
) -> UserFeedback:
"""
Collect structured feedback with optional follow-up.
When feedback is negative, prompt for specific reason
using checkboxes rather than free text.
"""
feedback = UserFeedback(
query_id=query_id,
feedback_type=feedback_type,
negative_reason=negative_reason,
timestamp=datetime.now()
)
# Store for analysis and training
await store_feedback(feedback)
# For enterprise: post to Slack for visibility
if feedback_type == FeedbackType.NEGATIVE:
await post_to_slack_channel(feedback)
return feedback
Key implementation details:
- Make buttons large and prominent (not hidden in corners)
- Ask follow-up questions only after negative feedback
- Use checkboxes for common issues rather than free text
- Log the query, retrieved documents, and user response together
Mining Implicit Feedback
While explicit feedback (ratings, comments) is valuable, users express opinions through their actions even when they do not provide direct feedback.
For Product Managers
Implicit signals to track:
| Signal | What It Indicates | Training Value |
|---|---|---|
| Query refinements | Initial response was inadequate | Negative example |
| Session abandonment | User gave up | Strong negative |
| Citation clicks | User found source relevant | Positive signal |
| Copy/paste actions | Response was useful | Strong positive |
| Regeneration requests | First response failed | Negative example |
| Workflow activation | System worked correctly | Strong positive |
The dating app insight: Dating apps like Tinder and Hinge train excellent embedding models because they have high volume, clear binary signals (swipe right/left), and simple objectives (match prediction). Design your RAG interactions to generate training labels naturally in the same way.
For Engineers
Mining hard negatives from user behavior:
from typing import List
from pydantic import BaseModel
class HardNegativeCandidate(BaseModel):
query: str
document_id: str
signal_type: str # "citation_deleted", "query_refined", "regenerated"
confidence: float
async def mine_hard_negatives(
session_id: str
) -> List[HardNegativeCandidate]:
"""
Extract hard negative training examples from user behavior.
Hard negatives are documents that appear relevant but
are actually unhelpful—the most valuable training examples
for improving retrieval quality.
"""
session = await get_session(session_id)
candidates = []
# Citation deletions are strong signals
for deleted_citation in session.deleted_citations:
candidates.append(HardNegativeCandidate(
query=session.query,
document_id=deleted_citation.document_id,
signal_type="citation_deleted",
confidence=0.9
))
# Query refinements suggest retrieval failure
if session.refined_queries:
original_docs = session.initial_retrieved_docs
for doc in original_docs:
if doc.id not in session.final_cited_docs:
candidates.append(HardNegativeCandidate(
query=session.query,
document_id=doc.id,
signal_type="query_refined",
confidence=0.7
))
return candidates
UI patterns for hard negative collection:
- Interactive citations: Let users mark citations as irrelevant
- Document filtering: Show top documents, let users remove irrelevant ones
- Regeneration after removal: When users remove a citation and regenerate, that document becomes a hard negative
Enterprise Feedback with Slack Integration
For B2B applications with dedicated customer success teams, Slack integration dramatically increases feedback collection.
For Product Managers
The enterprise feedback pattern:
- Create shared Slack channel with customer stakeholders
- Post negative feedback directly to the channel in real-time
- Allow your team to discuss issues and ask follow-up questions
- Document how feedback is addressed
- Report improvements during regular sync meetings
This approach typically increases feedback by 5x compared to traditional forms while building trust through transparency.
For Engineers
Slack webhook implementation:
import httpx
from typing import Optional
async def post_feedback_to_slack(
feedback: UserFeedback,
webhook_url: str,
channel: Optional[str] = None
):
"""
Post negative feedback to Slack for immediate visibility.
"""
if feedback.feedback_type != FeedbackType.NEGATIVE:
return
# Get context for the feedback
query_context = await get_query_context(feedback.query_id)
message = {
"blocks": [
{
"type": "header",
"text": {
"type": "plain_text",
"text": "Negative Feedback Alert"
}
},
{
"type": "section",
"fields": [
{
"type": "mrkdwn",
"text": f"*User:* {feedback.user_id or 'Anonymous'}"
},
{
"type": "mrkdwn",
"text": f"*Reason:* {feedback.negative_reason or 'Not specified'}"
}
]
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"*Query:* {query_context.query}"
}
},
{
"type": "actions",
"elements": [
{
"type": "button",
"text": {"type": "plain_text", "text": "View Full Context"},
"url": f"https://app.example.com/queries/{feedback.query_id}"
}
]
}
]
}
async with httpx.AsyncClient() as client:
await client.post(webhook_url, json=message)
Streaming: The Ultimate Progress Indicator
Streaming transforms the user experience from a binary "waiting/complete" pattern to a continuous flow. Users can start reading while the system continues generating.
For Product Managers
Why streaming matters:
- Users perceive animated progress bars as 11% faster even with identical wait times
- Users will tolerate up to 8 seconds of waiting with visual feedback
- Applications with engaging loading screens report higher satisfaction scores
- Streaming increases feedback collection rates by 30-40%
The implementation timing decision: If you are uncertain about implementing streaming, do it early. Migrating from non-streaming to streaming is significantly more complex than building with streaming from the start. Retrofitting can add weeks to your development cycle.
Only about 20% of companies implement streaming well—but the ones that do see massive UX improvements.
For Engineers
Basic streaming implementation:
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
import asyncio
import json
app = FastAPI()
@app.post("/query/stream")
async def stream_query_response(request: Request):
"""
Stream a response with interstitials, answer, and citations.
"""
data = await request.json()
query = data.get("query")
async def event_generator():
# Stream interstitials during retrieval
yield f"data: {json.dumps({'type': 'status', 'message': 'Searching documents...'})}\n\n"
documents = await retrieve_documents(query)
yield f"data: {json.dumps({'type': 'status', 'message': f'Found {len(documents)} relevant sources'})}\n\n"
# Stream the answer token by token
async for chunk in generate_answer_stream(query, documents):
yield f"data: {json.dumps({'type': 'answer', 'content': chunk})}\n\n"
await asyncio.sleep(0.01)
# Stream citations after answer
citations = extract_citations(documents)
for citation in citations:
yield f"data: {json.dumps({'type': 'citation', 'data': citation})}\n\n"
# Stream follow-up questions
followups = await generate_followups(query)
yield f"data: {json.dumps({'type': 'followups', 'questions': followups})}\n\n"
yield f"data: {json.dumps({'type': 'done'})}\n\n"
return StreamingResponse(
event_generator(),
media_type="text/event-stream"
)
What to stream:
- Interstitials explaining what is happening
- Answer tokens as they generate
- Citations as separate structured data
- Follow-up questions
- Function call arguments (for agentic systems)
Meaningful Interstitials
Generic loading indicators waste an opportunity. Meaningful interstitials build trust by showing users what is happening.
For Product Managers
Generic vs meaningful interstitials:
| Generic (Bad) | Meaningful (Good) |
|---|---|
| "Loading..." | "Searching 382,549 documents in our knowledge base..." |
| "Please wait" | "Finding relevant precedent cases from 2021-2022..." |
| "Processing" | "Analyzing 3 legal frameworks that might apply..." |
Meaningful interstitials can make perceived wait times up to 40% shorter than actual wait times.
For Engineers
Domain-specific interstitials:
def get_interstitials(query_category: str) -> list[str]:
"""
Return domain-specific interstitial messages.
"""
interstitials = {
"technical": [
"Scanning documentation and code repositories...",
"Identifying relevant code examples and patterns...",
"Analyzing technical specifications...",
],
"legal": [
"Searching legal databases and precedents...",
"Reviewing relevant case law and statutes...",
"Analyzing jurisdictional applicability...",
],
"medical": [
"Consulting medical literature and guidelines...",
"Reviewing clinical studies and research papers...",
"Analyzing treatment protocols...",
],
}
return interstitials.get(query_category, [
"Processing your query...",
"Searching for relevant information...",
"Analyzing related documents..."
])
Skeleton Screens
Skeleton screens are placeholder UI elements that mimic the structure of content while it loads. They create the impression that content is almost ready.
For Product Managers
Facebook's research: Skeleton screens significantly reduced perceived load times, resulting in better user retention and engagement. Users reported that the experience "felt faster" even when actual load times were identical to spinner-based approaches.
Skeleton screens work because they:
- Set clear expectations about what content is loading
- Provide a sense of progress without requiring actual progress data
- Create the impression that the system is actively working
- Give users visual stimulation during the waiting period
For Engineers
For RAG applications, skeleton screens can show:
- The structure of the answer before content loads
- Citation placeholders that will be filled
- Follow-up question button outlines
- Tool usage summaries that will appear
Platform-Specific Streaming: Slack Bots
Slack does not support true streaming, but you can create the illusion of progress through careful interaction design.
For Engineers
Slack bot pattern:
- React with eyes emoji immediately to acknowledge receipt
- Use threaded updates to show progress
- Mark completion with checkmark emoji
- Pre-fill feedback reactions (thumbs up, thumbs down, star)
from slack_sdk.web.async_client import AsyncWebClient
async def handle_slack_message(
client: AsyncWebClient,
channel: str,
thread_ts: str,
query: str
):
# Acknowledge receipt immediately
await client.reactions_add(
channel=channel,
timestamp=thread_ts,
name="eyes"
)
# Post progress update
progress_msg = await client.chat_postMessage(
channel=channel,
thread_ts=thread_ts,
text="Searching knowledge base..."
)
# Process query
response = await process_query(query)
# Update with final response
await client.chat_update(
channel=channel,
ts=progress_msg["ts"],
text=response.answer
)
# Mark as complete
await client.reactions_add(
channel=channel,
timestamp=thread_ts,
name="white_check_mark"
)
# Pre-fill feedback reactions
for emoji in ["thumbsup", "thumbsdown", "star"]:
await client.reactions_add(
channel=channel,
timestamp=progress_msg["ts"],
name=emoji
)
Pre-filling emoji reactions increases feedback collection by up to 5x compared to no reactions.
Citations: Building Trust and Collecting Feedback
Citations serve multiple purposes: they build trust, provide transparency, and create opportunities for feedback collection.
For Product Managers
Why citations matter:
- Users want to know where information comes from
- Citations show what data is being used to generate responses
- Interactive citations create opportunities for document-level relevance signals
Real results from a legal research team:
- 50,000+ labeled examples collected for fine-tuning
- User satisfaction increased from 67% to 89%
- Citation accuracy improved from 73% to 91% through feedback loops
- Attorney trust scores increased by 45%
For Engineers
XML-based citation pattern (most reliable):
According to the contract, <cite id="doc123" start="450" end="467">
the termination clause requires 30 days notice</cite> and
<cite id="doc124" start="122" end="134">includes a penalty
fee of $10,000</cite>.
Benefits of XML citations:
- Survives markdown parsing
- Enables precise highlighting
- Works well with fine-tuning
- Handles abbreviations and technical language
Implementation:
from pydantic import BaseModel
from typing import List
import re
class Citation(BaseModel):
id: str
document_id: str
start_char: int
end_char: int
cited_text: str
def extract_citations(response: str) -> List[Citation]:
"""
Extract citations from XML-tagged response.
"""
pattern = r'<cite id="([^"]+)" start="(\d+)" end="(\d+)">([^<]+)</cite>'
citations = []
for match in re.finditer(pattern, response):
citations.append(Citation(
id=match.group(1),
document_id=match.group(1).split("_")[0],
start_char=int(match.group(2)),
end_char=int(match.group(3)),
cited_text=match.group(4)
))
return citations
async def validate_citations(
citations: List[Citation],
documents: dict
) -> List[Citation]:
"""
Validate that cited text exists in source documents.
"""
valid_citations = []
for citation in citations:
doc = documents.get(citation.document_id)
if doc and citation.cited_text in doc.content:
valid_citations.append(citation)
return valid_citations
Fine-tuning for citation accuracy:
- Train on 10,000+ examples of correct citations
- Focus on common failure modes (wrong chunk, hallucinated citations)
- Always validate citations against source documents before display
Chain of Thought: Making Thinking Visible
Chain-of-thought prompting—asking the model to reason step by step before providing its final answer—typically provides a 10-15% performance improvement for classification and reasoning tasks.
For Product Managers
Why chain of thought matters:
- Improves accuracy by 10-20% on complex reasoning tasks
- Makes AI decision-making transparent to users
- Creates natural loading interstitials during streaming
- Builds trust by showing how conclusions were reached
With models like Claude and GPT-4, chain of thought has become standard practice. Even without reasoning models like o1, implementing chain of thought in business-relevant ways is consistently one of the highest-impact changes.
For Engineers
Chain of thought prompt structure:
def chain_of_thought_prompt(query: str, documents: list) -> str:
"""
Create a prompt that encourages step-by-step reasoning.
"""
context = "\n\n".join([f"DOCUMENT: {doc.content}" for doc in documents])
return f"""
Answer the user's question based on the provided documents.
First, think step by step about how to answer using the documents.
Then provide your final answer.
Structure your response like this:
<thinking>
Your step-by-step reasoning process here...
</thinking>
<answer>
Your final answer here, with citations to specific documents...
</answer>
USER QUESTION: {query}
DOCUMENTS:
{context}
"""
Streaming chain of thought as interstitial:
The thinking section can be streamed as a separate UI component, turning waiting time into a transparent window into how the system works through the problem.
Monologues: Solving Context Management
When dealing with long contexts, language models often struggle with recall and processing all instructions. Monologuing—having the model explicitly reiterate key information before generating a response—improves reasoning without complex architectural changes.
For Product Managers
When monologues help:
- Long documents where relevant information is scattered
- Complex queries requiring synthesis from multiple sources
- Tasks with many constraints or requirements
Case study: SaaS pricing quotes
A company needed to generate pricing quotes from sales call transcripts and a 15-page pricing document. Initial approach: provide both as context. Result: inconsistent quotes that missed key information.
With monologue approach:
- Model first reiterates variables that determine pricing
- Then identifies relevant parts of transcript
- Then determines which pricing tiers apply
- Finally generates the quote
Result: Quote accuracy improved from 62% to 94%. 90% of follow-up emails were accepted without edits.
For Engineers
Monologue prompt structure:
def monologue_prompt(query: str, documents: list, pricing_data: str) -> str:
"""
Create a prompt that encourages information reiteration.
"""
context = "\n\n".join([f"TRANSCRIPT: {doc.content}" for doc in documents])
return f"""
Generate a pricing quote based on the call transcript and pricing documentation.
First, reiterate the key variables that determine pricing options.
Then, identify specific parts of the transcript that relate to these variables.
Next, determine which pricing options from the documentation are most relevant.
Finally, provide a recommended pricing quote with justification.
QUESTION: {query}
TRANSCRIPT:
{context}
PRICING DOCUMENTATION:
{pricing_data}
MONOLOGUE AND ANSWER:
"""
Monologues often replace complex agent architectures. Rather than building multi-stage processes, you can achieve similar results with a single well-constructed monologue prompt.
Validation Patterns: Catching Errors Before Users
Validation patterns act as safety nets for your RAG system. For latency-insensitive applications, validators can significantly increase trust and satisfaction.
For Product Managers
When to use validators:
- High-stakes domains where errors have significant consequences
- Applications where users make important decisions based on output
- Scenarios where specific constraints must be enforced
- Cases where you need to increase user trust
Real example: A marketing team built a system to generate personalized emails with links to case studies. About 4% of emails contained invalid URLs. After implementing URL validation with one retry, the error rate dropped to 0%. After fine-tuning on the corrections, the base error rate dropped to nearly zero—the model learned from its corrections.
For Engineers
URL validation example:
import re
from urllib.parse import urlparse
import httpx
async def validate_urls_in_email(
email_body: str,
allowed_domains: list[str]
) -> tuple[bool, list[str]]:
"""
Validate that all URLs are valid and from allowed domains.
"""
url_pattern = r'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+'
urls = re.findall(url_pattern, email_body)
issues = []
for url in urls:
domain = urlparse(url).netloc
if domain not in allowed_domains:
issues.append(f"URL {url} contains disallowed domain {domain}")
continue
try:
async with httpx.AsyncClient() as client:
response = await client.head(url, timeout=3)
if response.status_code != 200:
issues.append(f"URL {url} returned status {response.status_code}")
except Exception as e:
issues.append(f"URL {url} failed to connect: {str(e)}")
return len(issues) == 0, issues
async def regenerate_if_invalid(
query: str,
initial_response: str,
allowed_domains: list[str]
) -> str:
"""
Validate and regenerate if URLs are problematic.
"""
is_valid, issues = await validate_urls_in_email(
initial_response, allowed_domains
)
if is_valid:
return initial_response
# Regenerate with specific guidance
issues_text = "\n".join(issues)
regeneration_prompt = f"""
The previously generated response contained URL issues:
{issues_text}
Please regenerate, either:
1. Removing problematic URLs entirely, or
2. Replacing them with valid URLs from: {', '.join(allowed_domains)}
Original request: {query}
"""
return await generate_response(regeneration_prompt)
Key insight: Validation both catches errors and creates training data. Each correction becomes a learning opportunity, gradually reducing the need for validation.
Strategic Rejection of Work
One of the most overlooked strategies for improving reliability is knowing when to reject work. Rather than delaying deployment until all edge cases are solved, implement strategic rejection for scenarios where your system is not yet strong enough.
For Product Managers
Why strategic rejection builds trust:
- Acknowledging limitations transparently builds confidence
- Users prefer "I don't know" to confidently wrong answers
- Rejection collects data about what users need
- Allows deployment sooner while collecting data to improve
Example rejection message:
"I notice you're asking about cross-jurisdictional implications of regulation X. Currently, I'm not confident in my ability to analyze multi-jurisdictional regulatory conflicts accurately. Would you like me to instead focus on the requirements within your primary jurisdiction, or connect you with a regulatory specialist?"
For Engineers
Implementing strategic rejection:
async def should_reject_query(
query: str,
confidence_threshold: float = 0.85
) -> tuple[bool, str | None]:
"""
Determine if a query should be politely rejected.
"""
query_category = await classify_query(query)
query_complexity = await assess_complexity(query)
expected_confidence = await predict_confidence(
query, query_category, query_complexity
)
if expected_confidence < confidence_threshold:
reason = (
f"This appears to be a {query_category} question with "
f"{query_complexity} complexity. Based on similar questions, "
f"our confidence is {expected_confidence:.0%}, which is below "
f"our threshold of {confidence_threshold:.0%}."
)
return True, reason
return False, None
Design rejection with precision-recall tradeoffs in mind—avoid rejecting questions you can actually answer well.
Showcasing Capabilities
While RAG systems can theoretically answer a wide range of questions, most excel at particular types. Explicitly highlighting what your system does well guides users toward successful interactions.
For Product Managers
Prompting the user, not just the model:
- Show suggested query types that leverage your strengths
- Create UI elements that highlight special capabilities
- Provide examples of successful interactions
- Use white space to showcase specialized capabilities
Perplexity provides a good example: their interface shows different capabilities (web search, academic papers, math equations) with specific UI elements, guiding users toward interactions that will be successful.
For Engineers
Implement capability showcasing through:
- Dynamic suggestion generation based on system strengths
- UI components that visually distinguish different capabilities
- Example queries that demonstrate successful patterns
- Clear labeling of experimental vs production-ready features
Case Study Deep Dive
Zapier Central: 4x Feedback Improvement
Zapier Central faced a common challenge: limited feedback despite active user engagement. Their feedback submission rates were around 10 per day, almost exclusively negative from frustrated users experiencing errors.
For Product Managers
The change: Instead of tiny, muted feedback buttons in the corner, they added a natural-looking chat message at the end of workflow tests asking: "Did this run do what you expected it to do?"
The results:
- Feedback submissions increased from 10 to 40 per day (4x improvement)
- Started receiving substantial positive feedback (previously almost non-existent)
- Built evaluation suite from 23 to 383 evaluations based on real interactions
- Could make informed decisions about model upgrades
Why it worked:
- Positioning: Request appeared as natural part of conversation
- Timing: Asked immediately after interaction while context was fresh
- Specificity: "Did this do what you expected?" is clearer than "How did we do?"
- Visibility: Larger buttons made the action obvious
For Engineers
Implementation details:
- Built internal feedback triaging system where all submissions land
- Implemented "labeling parties"—weekly team meetings to categorize feedback
- Added extensive metadata (tools used, context, entry point)
- Created tooling to easily convert feedback into formal evaluations
Mining implicit feedback:
- Workflow activation signals (user tests then activates = positive)
- Tool call validation errors (likely LLM mistake = negative)
- Follow-up message analysis (rephrasing = previous response inadequate)
- Hallucination detection (pattern matching for common hallucination phrases)
Legal Research Team: 50,000+ Labeled Examples
A legal research team implemented interactive citations for their in-house attorneys. Each response included citations linked to specific case law or statutes.
For Product Managers
The approach:
- Attorneys could click citations to see full context
- Could mark citations as relevant or irrelevant
- When marked irrelevant, system would regenerate without that source
The results:
- 50,000+ labeled examples collected for fine-tuning
- User satisfaction: 67% to 89% (+22 percentage points)
- Citation accuracy: 73% to 91% through feedback loops
- Attorney trust scores increased by 45%
For Engineers
Technical implementation:
- XML-based citation format with chunk IDs and text spans
- Validation layer verifying cited text exists in referenced chunks
- Fine-tuning on citation-specific tasks reduced errors from 4% to 0.1%
- Special handling for legal abbreviations and technical language
Implementation Guide
Quick Start for PMs
Week 1: Audit Current Feedback
- Measure current feedback collection rate
- Review feedback copy—is it specific to your value proposition?
- Identify where feedback buttons are hidden
- List implicit signals you could be tracking
Week 2: Implement Quick Wins
- Change feedback copy to be specific ("Did we answer your question?")
- Make feedback buttons larger and more prominent
- Add follow-up questions for negative feedback
- Set up basic logging of user interactions
Week 3: Plan Streaming and Citations
- Assess current latency and user abandonment rates
- Prioritize streaming implementation if not already in place
- Design citation format appropriate for your domain
- Plan validation patterns for high-stakes outputs
Detailed Implementation for Engineers
Phase 1: Feedback Infrastructure (1-2 weeks)
# 1. Define feedback schema
class FeedbackEvent(BaseModel):
query_id: str
session_id: str
feedback_type: FeedbackType
negative_reason: Optional[NegativeFeedbackReason]
timestamp: datetime
metadata: dict # tools used, entry point, etc.
# 2. Set up storage
async def store_feedback(event: FeedbackEvent):
# Store in database for analysis
await db.feedback.insert(event.dict())
# Post to Slack for enterprise customers
if event.feedback_type == FeedbackType.NEGATIVE:
await post_to_slack(event)
# 3. Implement implicit signal tracking
async def track_implicit_signals(session: Session):
signals = []
if session.query_refined:
signals.append(("query_refined", session.original_query))
if session.regenerated:
signals.append(("regenerated", session.original_response))
for deleted in session.deleted_citations:
signals.append(("citation_deleted", deleted.document_id))
return signals
Phase 2: Streaming Implementation (2-3 weeks)
# 1. Backend streaming endpoint
@app.post("/query/stream")
async def stream_response(request: QueryRequest):
async def generate():
# Stream interstitials
yield sse_event("status", "Searching documents...")
docs = await retrieve(request.query)
yield sse_event("status", f"Found {len(docs)} sources")
# Stream answer
async for chunk in generate_answer(request.query, docs):
yield sse_event("answer", chunk)
# Stream citations
for citation in extract_citations(docs):
yield sse_event("citation", citation)
yield sse_event("done", None)
return StreamingResponse(generate(), media_type="text/event-stream")
# 2. Frontend handling (React example)
# const eventSource = new EventSource('/query/stream');
# eventSource.onmessage = (event) => {
# const data = JSON.parse(event.data);
# switch(data.type) {
# case 'status': setStatus(data.message); break;
# case 'answer': setAnswer(prev => prev + data.content); break;
# case 'citation': setCitations(prev => [...prev, data.data]); break;
# }
# };
Phase 3: Quality of Life (1-2 weeks)
# 1. Citation validation
async def validate_and_filter_citations(
response: str,
documents: dict
) -> str:
citations = extract_citations(response)
valid = await validate_citations(citations, documents)
if len(valid) < len(citations):
# Log invalid citations for analysis
await log_invalid_citations(citations, valid)
return response # Or regenerate if too many invalid
# 2. Chain of thought wrapper
async def generate_with_cot(query: str, documents: list) -> Response:
prompt = chain_of_thought_prompt(query, documents)
raw_response = await generate(prompt)
# Parse thinking and answer sections
thinking = extract_section(raw_response, "thinking")
answer = extract_section(raw_response, "answer")
return Response(
thinking=thinking, # Can be shown as expandable section
answer=answer,
citations=extract_citations(answer)
)
# 3. Validation layer
async def generate_with_validation(
query: str,
validators: list[Validator]
) -> str:
response = await generate(query)
for validator in validators:
is_valid, issues = await validator.validate(response)
if not is_valid:
response = await regenerate_with_feedback(query, issues)
return response
Common Pitfalls
PM Pitfalls
PM Pitfall: Generic Feedback Copy
The mistake: Using vague questions like "How did we do?" or "Rate your experience."
Why it fails: Users do not know what aspect to evaluate. Responses are vague and uncorrelated with actual system performance.
The fix: Use specific questions aligned with your value proposition. "Did we answer your question?" for Q&A systems. "Did we take the correct actions?" for agentic systems.
PM Pitfall: Hidden Feedback Mechanisms
The mistake: Placing feedback buttons in corners or dropdown menus.
Why it fails: Users will not find them. You collect 0.1% feedback instead of 0.5%.
The fix: Make feedback impossible to miss. Place it directly after responses. Use large, prominent buttons.
PM Pitfall: Ignoring Implicit Signals
The mistake: Only tracking explicit thumbs up/down feedback.
Why it fails: You miss 90%+ of user signals. Query refinements, abandonment, and citation interactions are valuable data.
The fix: Track all user behaviors that indicate satisfaction or dissatisfaction.
Engineering Pitfalls
Engineering Pitfall: Retrofitting Streaming
The mistake: Building without streaming, planning to add it later.
Why it fails: Migrating from non-streaming to streaming is significantly more complex than building with streaming from the start. Can add weeks to development.
The fix: Implement streaming from day one, even if basic.
Engineering Pitfall: Unvalidated Citations
The mistake: Displaying citations without verifying they exist in source documents.
Why it fails: Hallucinated citations destroy trust. Users will stop believing any citations.
The fix: Always validate that cited text exists in referenced documents before display.
Engineering Pitfall: No Feedback Context
The mistake: Storing feedback without the query, retrieved documents, and response.
Why it fails: You cannot analyze why feedback was negative or use it for training.
The fix: Log complete context with every feedback event.
Related Content
Talks
How Zapier 4x'd Their AI Feedback Collection (Vitor)
Key insights:
- Positioning, visibility, and wording of feedback requests dramatically impacts response rates
- Mining implicit feedback from workflow activations, validation errors, and follow-up messages
- "Labeling parties" for team-wide feedback analysis
Why Your AI Is Failing in Production (Ben & Sidhant)
Key insights:
- Traditional error monitoring does not work for AI—there is no exception when something goes wrong
- The Trellis framework for organizing AI outputs into controllable segments
- Implicit signals (user frustration, task failures) vs explicit signals (ratings, regenerations)
Office Hours
Cohort 2, Week 3: Negative feedback handling, feedback lifecycle management, citation and UX best practices
Cohort 3, Week 3: Re-ranking models, user feedback integration, compute allocation strategies
Action Items
For Product Teams
- Audit current feedback collection - Measure your current rate and identify quick wins
- Rewrite feedback copy - Make it specific to your value proposition
- Plan enterprise feedback loops - Consider Slack integration for B2B customers
- Define implicit signals to track - Query refinements, abandonment, citation interactions
- Establish feedback-driven roadmap process - Regular review cycles with engineering
For Engineering Teams
- Implement streaming - If not already in place, prioritize this
- Add meaningful interstitials - Replace generic loading with domain-specific messages
- Build citation validation - Never display unvalidated citations
- Set up feedback logging - Store complete context with every feedback event
- Create hard negative mining pipeline - Extract training data from user behavior
Reflection Questions
-
What is your current feedback collection rate? What would 5x more feedback enable?
-
How visible are your feedback mechanisms? Could a new user find them in 5 seconds?
-
What implicit signals are you not currently tracking that could provide training data?
-
If you implemented streaming tomorrow, how would it change user experience?
-
What validation patterns would catch the most common errors in your system?
Summary
Key Takeaways for Product Managers
- Feedback copy matters more than UI - "Did we answer your question?" beats "How did we do?" by 5x
- Streaming is table stakes - Only 20% of companies do it well, but it dramatically improves UX
- Citations build trust and collect data - Interactive citations can generate 50,000+ labeled examples
- Strategic rejection builds confidence - "I don't know" is better than confidently wrong
Key Takeaways for Engineers
- Implement streaming from day one - Retrofitting adds weeks to development
- Track implicit signals - Query refinements, citation deletions, and regenerations are valuable training data
- Validate citations before display - Hallucinated citations destroy trust
- Use chain of thought for complex reasoning - 10-20% accuracy improvement with minimal effort
- Build validation layers - Catch errors before users see them, create training data from corrections
Further Reading
-
Nielsen Norman Group, "Progress Indicators Make a Slow System Less Insufferable"
-
Facebook Engineering, "Building Skeleton Screens"
-
OpenAI Documentation, "Streaming API Best Practices"
-
Anthropic, "Constitutional AI: Harmlessness from AI Feedback"
Navigation
- Previous: Chapter 2: Training Data and Fine-Tuning - Converting evaluations into training data
- Next: Chapter 4: Query Understanding and Prioritization - Finding patterns in user data
- Reference: Glossary | Quick Reference
- Book Index: Book Overview