RAG Performance Optimization and Advanced Patterns: GraphRAG and Hybrid Search
by Aaron Dsilva, Founding Engineer
RAG Performance Optimization and Advanced Patterns: GraphRAG and Hybrid Search
Making Your RAG System Actually Fast (And Smart)
The Hybrid Search Approach That Actually Works
Here's what we learned the hard way: you need both semantic search AND keyword search working together.
Think of it like this:
- Vector search is like having a really smart intern who "gets" what you mean
- Keyword search is like having a detail-oriented colleague who never misses the fine print
- Together they're unstoppable
In practice, this means combining:
- Dense vectors for questions like "How do I cancel my subscription?" (semantic understanding)
- BM25 keyword matching for queries like "What's the API rate limit for the /users endpoint?" (exact technical terms)
The Reranking Magic
Once you've got results from both approaches, you need to merge them intelligently. The most robust approach we've found uses Reciprocal Rank Fusion (RRF):
Combined_Score = 1/(rank_vector + k) + 1/(rank_keyword + k)
Where k
is usually set to 60. Simple, effective, and doesn't require training additional models.
Alternative Reranking Approaches:
- Cross-encoder models like bge-reranker (more accurate but slower)
- Cohere's rerank API (excellent quality, costs money)
- Custom reranking models trained on your specific domain (highest accuracy, most effort)
The Need for Speed
Performance Optimization Stack
Layer 1: Caching (80% of the wins)
Layer 2: Async processing (15% of the wins)
Layer 3: Hardware optimization (5% of the wins)
Focus on caching first—it's where you'll get the biggest performance improvements for the least effort.
The Smart Caching Strategy:
Query Result Caching:
- Cache identical questions for 24 hours
- Use fuzzy matching for "similar enough" queries
- Store both the retrieved chunks AND the final answer
Embedding Caching:
- Never re-embed the same document
- Cache embeddings with document hashes
- Invalidate only when content actually changes
Chunk Preprocessing:
- Cache parsed and chunked documents
- Store metadata with the chunks
- Update incrementally, not from scratch
Advanced Performance Patterns
Streaming Responses: Don't wait for the entire answer to be generated. Stream the response back to users as it's being created. This makes your system feel 2-3x faster even if the actual processing time is the same.
Prefetch Common Queries: Analyze your query logs and pre-compute answers for the most common questions. Store these in a simple key-value cache for instant responses.
Smart Batch Processing: If you're processing multiple questions at once, batch them through your LLM calls. Most APIs support batching, which dramatically reduces overhead.
Monitoring & Evaluation: Catching Problems Before Users Do
The Metrics That Actually Matter
Response Quality Metrics:
- Answer Accuracy - Are we giving correct information?
- Relevance Score - Are the retrieved documents actually related?
- Completeness - Are we answering the full question?
- Citation Quality - Can users trace answers back to sources?
System Performance Metrics:
- Query Latency (P95) - How fast is the slowest 5% of queries?
- Retrieval Precision - What percentage of retrieved docs are useful?
- Cache Hit Rate - How often are we avoiding expensive recomputation?
- Error Rate - How often does something break?
User Experience Metrics:
- User Satisfaction - Thumbs up/down feedback
- Query Abandonment - Do users give up waiting?
- Follow-up Questions - Are users getting what they need?
- Session Duration - Are users finding value?
The RAGAS Evaluation Framework
RAGAS (Retrieval-Augmented Generation Assessment) is the closest thing we have to a standard for evaluating RAG systems. Here's what it measures:
Faithfulness: Does the answer contradict the retrieved documents? Answer Relevancy: Does the answer actually address the question? Context Precision: Are the top-ranked retrieved docs relevant? Context Recall: Did we retrieve all the relevant information available?
RAGAS Implementation:
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
# Your evaluation dataset
dataset = {
'question': [...],
'answer': [...],
'contexts': [...],
'ground_truths': [...]
}
result = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
Setting Up Continuous Evaluation:
- Sample 1-5% of production queries for evaluation
- Mix automatic RAGAS scoring with human review
- Set up alerts when scores drop below thresholds
- Track trends to catch gradual degradation
Production RAG Monitoring Success
Challenge
Needed continuous monitoring of RAG quality across 10K+ daily queries without manual review overhead
Solution
Implemented RAGAS automated evaluation with human review sampling and trend monitoring
Results
Reduced manual review effort by 85% while improving response quality through early detection of performance degradation
Key Metrics
GraphRAG: When Traditional RAG Hits a Wall
The Problem Traditional RAG Can't Solve
Traditional RAG works great for direct questions: "What's our refund policy?" or "How do I reset my password?"
But it struggles with questions like:
- "What are the common themes across all customer complaints this quarter?"
- "How do different product features relate to each other?"
- "What patterns emerge from our sales data across regions?"
Why? Because these questions require connecting information across multiple documents and identifying patterns that span your entire corpus.
How GraphRAG Changes the Game
Microsoft's GraphRAG doesn't just retrieve relevant chunks—it builds a knowledge graph from your documents, then uses that graph to answer complex analytical questions.
The GraphRAG Process:
- Entity Extraction: Identify people, places, concepts, and relationships
- Community Detection: Group related entities into logical clusters
- Community Summarization: Create summaries of each cluster
- Query-time Graph Traversal: Use the graph structure to find connected information
Microsoft's Numbers: Research shows GraphRAG has a 70-80% "win rate" vs traditional RAG for complex analytical questions. The trade-off? It's more expensive and complex to implement.
When to Use GraphRAG
Good Candidates:
- Research and analysis use cases
- Legal document review requiring pattern identification
- Business intelligence queries across large document sets
- Investigative journalism connecting disparate sources
- Academic research synthesis and discovery
Skip GraphRAG For:
- Simple Q&A use cases
- Technical documentation lookup
- Customer support with direct answers
- Cost-sensitive implementations
- Small document sets (<1000 documents)
Building Knowledge Connections
Implementation Strategy:
Phase 1: Entity and Relationship Extraction Use LLMs to identify entities and relationships in your documents:
- People, organizations, products, concepts
- "mentions", "relates to", "part of", "caused by" relationships
- Store in a graph database (Neo4j, Amazon Neptune, or ArangoDB)
Phase 2: Community Detection Group related entities using algorithms like:
- Leiden algorithm (Microsoft's choice) for high-quality communities
- Louvain method for faster processing
- Hierarchical clustering for nested community structures
Phase 3: Query Processing For analytical questions:
- Identify which communities contain relevant information
- Retrieve community summaries instead of individual chunks
- Use summaries to guide deeper retrieval if needed
- Generate answers that synthesize across communities
The Cost Reality: GraphRAG is 2-5x more expensive than traditional RAG due to:
- Entity extraction requiring LLM calls for every document
- Graph storage and maintenance overhead
- More complex query processing
- Community summarization costs
But for analytical use cases, the quality improvement often justifies the cost.
Hybrid Approach: Best of Both Worlds
Smart Implementation Strategy:
- Use traditional RAG for direct, factual questions
- Route analytical queries to GraphRAG automatically
- Implement query classification to choose the right approach
- Cache community summaries to reduce ongoing costs
Query Routing Logic:
def route_query(question):
analytical_keywords = ['trends', 'patterns', 'themes', 'analysis', 'compare', 'relationship']
if any(keyword in question.lower() for keyword in analytical_keywords):
return "graphrag"
else:
return "traditional_rag"
Progressive Enhancement:
- Start with traditional RAG for 80% of use cases
- Add GraphRAG for specific analytical requirements
- Monitor which approach works better for different query types
- Gradually expand GraphRAG usage based on proven value
The key insight: GraphRAG isn't a replacement for traditional RAG—it's a powerful complement for analytical use cases that require understanding connections and patterns across large document collections.
Ready to implement these optimization techniques in your RAG system? The performance improvements and advanced patterns you've learned here will transform your RAG system from a basic prototype into a production-ready solution that scales with your needs while maintaining quality and user satisfaction.