Overcoming Latency: Streaming and Interstitials
Key Insight
Perceived performance beats actual performance—users will wait 8 seconds with progress bars but abandon after 3 seconds of silence. Streaming isn't just about showing text faster. It's about maintaining user engagement through the entire retrieval-generation pipeline. Implement streaming early because retrofitting it later adds weeks to your development cycle.
Learn the Complete RAG Playbook
All of this content comes from my Systematically Improving RAG Applications course. Readers get 20% off with code EBOOK. Join 500+ engineers who've transformed their RAG systems from demos to production-ready applications.
Introduction
RAG applications face a fundamental challenge: the processes involved—retrieval, generation, validation, citation lookup—take time. Even accurate answers lose value if users get frustrated waiting for them.
Perceived performance often matters more than actual performance. Users perceive responsive systems as faster even when the total completion time is identical. This chapter covers practical approaches to address this challenge.
Understanding the Perception Gap: Perceived wait times can be up to 25% longer than actual wait times when users have no visibility into system progress. Showing meaningful progress can make perceived wait times up to 40% shorter.
"Streaming has become table stakes in modern LLM applications. Users expect responses instantly, and implementing streaming significantly improves both actual and perceived performance. Only about 20% of companies I work with have a good understanding of how to implement streaming effectively."
We'll explore two complementary approaches to addressing latency:
- Streaming responses to show progress and deliver content incrementally
- Designing meaningful interstitials that engage users while processing occurs
These techniques not only improve user experience but also lead to higher engagement and more feedback collection, strengthening the improvement flywheel we established in the previous chapter.
Implementation Timing: If you're on the fence about implementing streaming in your RAG application, do it early. Migrating from a non-streaming to a streaming application is significantly more complex than building with streaming from the start. It can add weeks to your development cycle if attempted later in the project lifecycle.
Impact of Visual Feedback
- Users perceive animated progress bars as 11% faster even when wait times are identical - Users will tolerate up to 8 seconds of waiting when given visual feedback, reducing abandonment rates - Applications with engaging loading screens report higher satisfaction scores - Facebook discovered that skeleton screens significantly reduced perceived load times, resulting in better user retention and engagement
The strategies we'll cover in this chapter are becoming essential components of modern LLM applications. By the end of this chapter, you'll understand how to turn waiting time from a point of frustration to an opportunity for engagement and trust-building.
Animation and Perceived Performance
Before diving into streaming implementations, let's understand why animated indicators are so effective at improving perceived performance. Research in cognitive psychology reveals that humans perceive time differently when observing movement.
Research on Progress Indicators: Nielsen Norman Group found that users reported 15-20% faster perceived load time when shown an animated progress indicator compared to a static wait screen, with identical actual load times.
Animated indicators work by:
- Giving users confidence that the system is actively working
- Drawing attention away from the passage of time
- Setting expectations about progress and completion
The most effective indicators for RAG systems are those that convey meaningful information about what's happening behind the scenes, not just generic loading animations.
Consider how differently users perceive these three waiting experiences:
- A static screen with no feedback
- A generic spinning wheel
- A step-by-step indicator showing "Searching relevant documents (2/5 complete)..."
The third approach not only feels faster but also builds trust by providing transparency into the process.
Streaming Responses: The Ultimate Progress Indicator
Streaming takes the concept of progress indicators to its logical conclusion by delivering content to users as it's generated, rather than waiting for the entire response to complete. This creates a much better user experience by:
- Showing immediate activity, reducing uncertainty
- Providing useful content while generation continues
- Allowing users to begin reading before the full response is ready
In a traditional RAG implementation, users submit a query and wait in silence until the full response appears. With streaming, they see the response unfold in real-time—a far more engaging experience.
When to Implement Streaming
My recommendation is to stream everything when possible. You can:
- Stream interstitials to explain latency and help users understand what's happening
- Stream different results and UI components so users don't have to wait for completion
- Stream tool calls and function arguments to show intermediate states
- Implement skeleton screens (like those used by Facebook, LinkedIn, and Slack) to improve perceived latency
"I've seen companies experience 30-40% higher feedback collection rates after implementing effective streaming compared to traditional 'wait and display' approaches. This creates a cycle where better performance leads to more feedback, which enables more targeted improvements."
sequenceDiagram
participant User
participant Frontend
participant Backend
participant Retriever
participant Generator
User->>Frontend: Submits query
Frontend->>Backend: Sends query
Note over Frontend: Shows "Thinking..." animation
Backend->>Retriever: Requests relevant documents
Retriever->>Backend: Returns documents
Note over Backend: Documents retrieved
Backend->>Generator: Generates response with documents
Note over Frontend: Shows "Generating response..."
loop Streaming
Generator->>Backend: Streams token chunks
Backend->>Frontend: Forwards token chunks
Frontend->>User: Displays incremental response
end
Note over Frontend: Full response displayed
Streaming changes the user experience from a binary "waiting/complete" pattern to a continuous flow. Users can start reading while the system continues generating.
Technical Implementation of Streaming
Implementing streaming requires coordination across your entire stack:
- A generation endpoint that supports streaming
- Backend routes that maintain open connections
- Frontend components that render incremental updates
Implementation Timing: If you're on the fence about implementing streaming, do it early. Migrating from a non-streaming to a streaming application is significantly more complex than building it from the start. It can add weeks to your development cycle if attempted later in the project lifecycle.
Most modern language models and APIs support streaming, though the specific implementation varies. The effort is worth it - side-by-side comparisons show improved user experience, with streaming responses feeling much more responsive than waiting for complete responses:
# Example using OpenAI's API for streaming
import openai
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
import asyncio
app = FastAPI()
@app.post("/query/stream")
async def stream_query_response(request: Request):
"""
Stream a response to a user query.
This endpoint:
1. Processes the incoming query
2. Retrieves relevant documents
3. Streams the generated response
"""
# Parse the incoming request
data = await request.json()
query = data.get("query")
# Retrieve relevant documents (non-streaming part)
documents = retrieve_documents(query)
context = prepare_context(documents)
# Set up streaming response
async def event_generator():
# Create a streaming completion
response = await openai.ChatCompletion.acreate(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"Query: {query}\n\nContext: {context}"}
],
stream=True # Enable streaming
)
# Yield chunks as they arrive
async for chunk in response:
if chunk.choices[0].delta.get("content"):
yield f"data: {chunk.choices[0].delta.content}\n\n"
await asyncio.sleep(0.01) # Small delay to control flow rate
yield "data: [DONE]\n\n"
# Return a streaming response
return StreamingResponse(
event_generator(),
media_type="text/event-stream"
)
On the frontend, you'll need to handle Server-Sent Events (SSE) or WebSockets to receive and display the streamed content:
Showing Function Call Arguments
One unique advantage of streaming is the ability to show users not just the final response but also the thinking and processing that led to it. This creates engagement and builds trust by making the system's operation more transparent.
For example, you can stream the function calls and arguments that your RAG system is using:
This approach gives users insight into how their query is being processed, creating engagement during what would otherwise be idle waiting time.
Streaming Structured Data
Streaming isn't limited to plain text—you can stream structured data like citations, follow-up questions, or data visualizations. This technique is especially valuable for complex RAG applications where responses have multiple components.
Streaming in Modern Applications
Libraries like Instruct and modern LLM frameworks now support streaming structured data. This allows applications to:
- Stream citations with IDs and titles
- Stream different response components in parallel
- Stream function calls and their arguments
- Build dynamic UI that renders each component as it becomes available
Here's how you might implement structured streaming for a response that includes an answer, citations, and follow-up questions:
async def stream_structured_response(query: str):
"""
Stream a structured response with multiple components.
Parameters:
- query: The user's question
Returns:
- A streaming response with structured components
"""
# Retrieve documents (non-streaming)
documents = retrieve_documents(query)
# Start streaming response components
async def generate_stream():
# Send response type indicator
yield json.dumps({"type": "start", "components": ["answer", "citations", "followup"]}) + "\n"
# Stream the answer generation
answer_chunks = generate_answer_stream(query, documents)
async for chunk in answer_chunks:
yield json.dumps({"type": "answer", "content": chunk}) + "\n"
await asyncio.sleep(0.02)
# Stream citations after the answer
citations = extract_citations(documents)
for citation in citations:
yield json.dumps({
"type": "citation",
"id": citation["id"],
"title": citation["title"],
"text": citation["text"][:100] + "...",
"relevance": citation["relevance"]
}) + "\n"
await asyncio.sleep(0.05)
# Generate and stream follow-up questions
followups = generate_followup_questions(query, documents)
yield json.dumps({"type": "followup", "questions": followups}) + "\n"
# Signal completion
yield json.dumps({"type": "end"}) + "\n"
return StreamingResponse(generate_stream(), media_type="application/json")
On the frontend, you'd handle this structured stream by updating different UI components based on the message type:
This approach creates a dynamic, engaging experience where different parts of the response appear progressively, keeping users engaged throughout the generation process.
Meaningful Interstitials: Making Waiting Engaging
For situations where some processing must happen before any content can be displayed, well-designed interstitials can turn waiting time from a frustrating experience into an engaging one.
The key principle is to make interstitials meaningful rather than generic. Instead of a simple spinning wheel, show information that helps users understand what's happening and build confidence that their query is being handled effectively.
Skeleton Screens: The Illusion of Progress
Skeleton screens are placeholder UI elements that mimic the structure of content while it loads. Unlike traditional spinners or progress bars, they create the impression that content is almost ready by showing its outline.
Facebook's Research: Facebook's user experience research discovered that skeleton screens significantly reduced perceived load times, resulting in better user retention and engagement. Users reported that the experience "felt faster" even when actual load times were identical to spinner-based approaches.
Skeleton screens work because they:
- Set clear expectations about what content is loading
- Provide a sense of progress without requiring actual progress data
- Create the impression that the system is actively working on the request
- Give users visual stimulation during the waiting period
For RAG applications, skeleton screens can be particularly effective when showing:
- The structure of the answer before content loads
- Citation placeholders that will be filled
- Follow-up question button outlines
- Tool usage summaries that will appear
Meaningful vs. Generic Interstitials
Generic Interstitial: "Loading..."
Meaningful Interstitial: - "Searching 382,549 documents in our knowledge base..." - "Finding relevant precedent cases from 2021-2022..." - "Analyzing 3 legal frameworks that might apply to your question..."
Meaningful interstitials should:
- Be specific about what the system is doing
- Include actual metrics when possible (number of documents, etc.)
- Update dynamically to show progress
- Maintain a confident, authoritative tone
Here's how you might implement meaningful interstitials:
async def generate_interstitials(query: str):
"""
Generate meaningful interstitial messages for a query.
Parameters:
- query: The user's question
Returns:
- A sequence of interstitial messages
"""
# Analyze the query to determine appropriate interstitials
category = classify_query(query)
# Define category-specific interstitials
interstitials = {
"technical": [
"Scanning documentation and code repositories...",
"Identifying relevant code examples and patterns...",
"Analyzing technical specifications and requirements...",
],
"legal": [
"Searching legal databases and precedents...",
"Reviewing relevant case law and statutes...",
"Analyzing jurisdictional applicability...",
],
"medical": [
"Consulting medical literature and guidelines...",
"Reviewing clinical studies and research papers...",
"Analyzing treatment protocols and best practices...",
],
# Add other categories as needed
}
# Add domain-specific metrics if available
try:
# For technical queries, add repository info
if category == "technical":
repo_count = get_repository_count()
interstitials["technical"].append(f"Searching across {repo_count} code repositories...")
# For legal queries, add document counts
elif category == "legal":
case_count = get_case_count()
interstitials["legal"].append(f"Analyzing {case_count} potentially relevant cases...")
except:
# Fall back to generic but still domain-specific messages
pass
# Get the relevant list based on category, or use default
message_list = interstitials.get(category, [
"Processing your query...",
"Searching for relevant information...",
"Analyzing related documents..."
])
# Return the message list
return message_list
On the frontend, you'd display these interstitials in sequence during the waiting period:
Optimizing Actual Performance
While perceived performance is critical, we shouldn't neglect actual performance optimizations. Here are several strategies for reducing real latency in RAG applications:
1. Optimize Your Retrieval Pipeline
The retrieval phase is often the most time-consuming part of a RAG system. Consider these optimizations:
- Use approximate nearest neighbor search instead of exact search for large collections
- Implement a tiered retrieval approach that filters candidates quickly before precise ranking
- Pre-compute and cache embeddings for your document collection
- Shard your vector database to distribute search across multiple instances
2. Implement Caching
Caching significantly improves performance for repeated or similar queries:
- Semantic caching: Cache results based on embedding similarity, not just exact matches
- Fragment caching: Cache individual retrieved documents even if the full query is new
- Result caching: Store complete responses for common queries
Here's a simple implementation of semantic caching:
3. Implement Progressive Loading
Load different components of your response progressively, with the most important parts first:
- Show the direct answer before loading citations
- Display key findings before detailed explanations
- Show high-confidence sections before speculative ones
4. Optimize Model Usage
Language model inference can be optimized through:
- Quantization: Use 8-bit or 4-bit quantized models where appropriate
- Distillation: Train smaller, faster models for specific query types
- Parallel inference: Process multiple documents or query components simultaneously
- Model selection: Use smaller models for simpler tasks, reserving larger models for complex reasoning
Platform-Specific Implementations
Streaming in Slack Bots
Implementing streaming in a Slack bot environment presents unique challenges and opportunities. While Slack doesn't support true streaming in the same way as a web interface, you can create the illusion of progress and responsiveness through careful interaction design.
Here's a simple but effective approach for Slack bots:
-
Initial Acknowledgment: React with the 👀 emoji immediately when receiving a message to indicate that the bot has seen the request and is processing it.
-
Progress Updates: Use message updates or threading to show progress, such as:
Searching through knowledge base...
Found 5 relevant documents...
Generating response...
-
Completion Indicator: Mark the message with a ✅ emoji when the response is complete.
-
Feedback Collection: Pre-fill emoji reactions (👍 👎 ⭐) to prompt users for feedback on the response quality.
Slack Feedback Collection
By pre-filling emoji reactions (👍 👎 ⭐), you increase the likelihood of receiving user feedback. This approach places feedback options directly in the user's view, rather than requiring them to take additional steps. In testing, this approach increased feedback collection rates by up to 5x compared to text-based feedback prompts.
The Connection Between Streaming, Performance, and Feedback
The techniques discussed in this chapter aren't just about improving user experience—they directly strengthen the feedback collection mechanisms we established in Chapter 3.1.
Research consistently shows that users provide more feedback when systems feel responsive and engaging. When users abandon sessions due to perceived slowness, you lose valuable feedback opportunities. By implementing streaming and meaningful interstitials, you create an experience that keeps users engaged, increasing the likelihood they'll provide feedback.
In our experience, implementations with effective streaming collect 30-40% more feedback compared to traditional "wait and display" approaches. This creates a positive cycle where better performance leads to more feedback, which enables more targeted improvements.
The most successful RAG applications aren't just accurate—they're responsive, engaging, and transparent. By applying the techniques in this chapter, you create an experience that keeps users engaged throughout the interaction, building trust and encouraging the feedback that fuels continuous improvement.
Real-world Impact
"For a customer support RAG application, implementing streaming and feedback-optimized interstitials increased our feedback collection rate from 5.6% to over 25%. This allowed us to fine-tune five times faster and quickly identify the most problematic query types. Within six weeks, we improved customer satisfaction scores by 34% by addressing these specific failure modes."
Conclusion: Performance as Experience Design
Throughout this chapter, we've explored how to overcome latency through a combination of streaming responses, meaningful interstitials, skeleton screens, platform-specific implementations, and technical optimizations. The key insight is that performance isn't just a technical concern—it's a fundamental aspect of experience design that directly impacts your feedback collection rates.
By implementing streaming, you change the user experience from a binary "waiting/complete" pattern to a continuous flow of information. With skeleton screens, you set clear expectations about what content is loading. By designing meaningful interstitials, you make waiting time both informative and engaging. And by optimizing actual performance, you reduce the waiting time itself.
These approaches work in concert to create a responsive, engaging RAG experience that keeps users invested and encourages feedback. Users provide up to 5x more feedback when your application feels responsive and engaging. This creates a strong feedback loop where better performance leads to more feedback, which enables more targeted improvements.
Implementation Priority
If you're at the start of your RAG implementation journey, prioritize streaming first. It's much easier to integrate from the beginning than to retrofit later. Next, focus on meaningful interstitials and skeleton screens. Finally, implement platform-specific optimizations for your particular usage context (web, Slack, mobile, etc.).
In the next chapter, we'll build on these foundations by exploring quality-of-life improvements like interactive citations, chain-of-thought reasoning, and validation patterns. These elements further enhance the user experience while creating additional opportunities for feedback collection.
Reflection Questions
-
What aspects of your RAG application's user experience are most affected by latency?
-
How could you modify your current interface to show meaningful progress during retrieval and generation?
-
What information could you stream incrementally to improve perceived performance?
-
Which components of your RAG pipeline are the biggest contributors to actual latency? How might you optimize them?
-
How would implementing streaming affect your feedback collection mechanisms?
-
Is your feedback collection UI too subtle? How could you improve its visibility and clarity?
-
How might you implement skeleton screens in your particular application context?
-
If your application runs on platforms like Slack or Teams, what platform-specific techniques could you use to improve perceived latency?
-
How could you use interstitials to educate users about your system's capabilities and build trust?
-
What metrics would you track to measure the impact of your latency improvements on user satisfaction and feedback collection?
Summary
Latency is a critical challenge in RAG applications that directly impacts both user experience and feedback collection rates. In this chapter, we've explored a comprehensive approach to overcoming latency challenges:
Streaming responses turn waiting into an engaging experience where users see answers unfold in real time, improving perceived performance and user engagement. Data shows that streaming can increase feedback collection rates by 30-40% compared to traditional approaches.
Skeleton screens create the illusion of progress by showing content outlines before the actual content loads. Companies like Facebook have found that skeleton screens significantly reduce perceived load times and improve user retention.
Meaningful interstitials make necessary waiting periods informative and less frustrating by communicating what's happening behind the scenes. Well-designed interstitials can make perceived wait times up to 40% shorter than actual wait times.
Platform-specific implementations like Slack bots with emoji reactions can create pseudo-streaming experiences and increase feedback collection, with pre-filled emoji reactions driving up to 5x more feedback.
These techniques, combined with actual performance optimizations like caching and progressive loading, create RAG applications that feel responsive and trustworthy even when complex processing is occurring. The result is not just better user experience but also significantly more feedback, fueling a continuous improvement cycle.
Remember: If you only implement one improvement from this chapter, make it streaming. It's substantially easier to build streaming from the start than to retrofit it later, and it has the biggest impact on both perceived performance and feedback collection rates.
Additional Resources
-
Nielsen Norman Group, "Progress Indicators Make a Slow System Less Insufferable" - Research on how progress indicators affect perceived wait times
-
Google Developers, "Measuring Perceived Performance" - Metrics and techniques for measuring how users perceive application performance
-
OpenAI Documentation, "Streaming API Best Practices" - Implementation details for streaming with OpenAI models
-
GitHub Repository: Streaming-RAG-Implementation - Example implementation of a streaming RAG application
-
Facebook Engineering, "Building Skeleton Screens" - Facebook's approach to implementing skeleton screens for improved perceived performance
-
Anthropic Structured Outputs Guide - Guide for generating structured data with Claude that can be streamed incrementally
-
Slack API Documentation, "Adding Reactions to Messages" - How to programmatically add emoji reactions to messages for feedback collection
-
Article: "The Psychology of Waiting Lines" - David Maister's research on the psychological aspects of waiting
-
GitHub Repository: React Skeleton Screens - Open-source library for implementing skeleton screens in React applications