RAG Performance Optimization and Advanced Patterns: GraphRAG and Hybrid Search

by Aaron Dsilva, Founding Engineer

RAG Performance Optimization and Advanced Patterns: GraphRAG and Hybrid Search

Making Your RAG System Actually Fast (And Smart)

The Hybrid Search Approach That Actually Works

Here's what we learned the hard way: you need both semantic search AND keyword search working together.

Think of it like this:

  • Vector search is like having a really smart intern who "gets" what you mean
  • Keyword search is like having a detail-oriented colleague who never misses the fine print
  • Together they're unstoppable

In practice, this means combining:

  • Dense vectors for questions like "How do I cancel my subscription?" (semantic understanding)
  • BM25 keyword matching for queries like "What's the API rate limit for the /users endpoint?" (exact technical terms)

The Reranking Magic

Once you've got results from both approaches, you need to merge them intelligently. The most robust approach we've found uses Reciprocal Rank Fusion (RRF):

Combined_Score = 1/(rank_vector + k) + 1/(rank_keyword + k)

Where k is usually set to 60. Simple, effective, and doesn't require training additional models.

Alternative Reranking Approaches:

  • Cross-encoder models like bge-reranker (more accurate but slower)
  • Cohere's rerank API (excellent quality, costs money)
  • Custom reranking models trained on your specific domain (highest accuracy, most effort)

The Need for Speed

Performance Optimization Stack

Layer 1: Caching (80% of the wins)
Layer 2: Async processing (15% of the wins)
Layer 3: Hardware optimization (5% of the wins)

Focus on caching first—it's where you'll get the biggest performance improvements for the least effort.

The Smart Caching Strategy:

Query Result Caching:

  • Cache identical questions for 24 hours
  • Use fuzzy matching for "similar enough" queries
  • Store both the retrieved chunks AND the final answer

Embedding Caching:

  • Never re-embed the same document
  • Cache embeddings with document hashes
  • Invalidate only when content actually changes

Chunk Preprocessing:

  • Cache parsed and chunked documents
  • Store metadata with the chunks
  • Update incrementally, not from scratch

Advanced Performance Patterns

Streaming Responses: Don't wait for the entire answer to be generated. Stream the response back to users as it's being created. This makes your system feel 2-3x faster even if the actual processing time is the same.

Prefetch Common Queries: Analyze your query logs and pre-compute answers for the most common questions. Store these in a simple key-value cache for instant responses.

Smart Batch Processing: If you're processing multiple questions at once, batch them through your LLM calls. Most APIs support batching, which dramatically reduces overhead.


Monitoring & Evaluation: Catching Problems Before Users Do

The Metrics That Actually Matter

Response Quality Metrics:

  • Answer Accuracy - Are we giving correct information?
  • Relevance Score - Are the retrieved documents actually related?
  • Completeness - Are we answering the full question?
  • Citation Quality - Can users trace answers back to sources?

System Performance Metrics:

  • Query Latency (P95) - How fast is the slowest 5% of queries?
  • Retrieval Precision - What percentage of retrieved docs are useful?
  • Cache Hit Rate - How often are we avoiding expensive recomputation?
  • Error Rate - How often does something break?

User Experience Metrics:

  • User Satisfaction - Thumbs up/down feedback
  • Query Abandonment - Do users give up waiting?
  • Follow-up Questions - Are users getting what they need?
  • Session Duration - Are users finding value?

The RAGAS Evaluation Framework

RAGAS (Retrieval-Augmented Generation Assessment) is the closest thing we have to a standard for evaluating RAG systems. Here's what it measures:

Faithfulness: Does the answer contradict the retrieved documents? Answer Relevancy: Does the answer actually address the question? Context Precision: Are the top-ranked retrieved docs relevant? Context Recall: Did we retrieve all the relevant information available?

RAGAS Implementation:

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall

# Your evaluation dataset
dataset = {
    'question': [...],
    'answer': [...],
    'contexts': [...],
    'ground_truths': [...]
}

result = evaluate(
    dataset, 
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)

Setting Up Continuous Evaluation:

  1. Sample 1-5% of production queries for evaluation
  2. Mix automatic RAGAS scoring with human review
  3. Set up alerts when scores drop below thresholds
  4. Track trends to catch gradual degradation
Case StudyEnterprise SaaS Platform

Production RAG Monitoring Success

Challenge

Needed continuous monitoring of RAG quality across 10K+ daily queries without manual review overhead

Solution

Implemented RAGAS automated evaluation with human review sampling and trend monitoring

Results

Reduced manual review effort by 85% while improving response quality through early detection of performance degradation

Key Metrics

Manual Review Reduction85%
Quality Score Improvement23%
Issue Detection Time< 24 hours
Implementation Time2 weeks

GraphRAG: When Traditional RAG Hits a Wall

The Problem Traditional RAG Can't Solve

Traditional RAG works great for direct questions: "What's our refund policy?" or "How do I reset my password?"

But it struggles with questions like:

  • "What are the common themes across all customer complaints this quarter?"
  • "How do different product features relate to each other?"
  • "What patterns emerge from our sales data across regions?"

Why? Because these questions require connecting information across multiple documents and identifying patterns that span your entire corpus.

How GraphRAG Changes the Game

Microsoft's GraphRAG doesn't just retrieve relevant chunks—it builds a knowledge graph from your documents, then uses that graph to answer complex analytical questions.

The GraphRAG Process:

  1. Entity Extraction: Identify people, places, concepts, and relationships
  2. Community Detection: Group related entities into logical clusters
  3. Community Summarization: Create summaries of each cluster
  4. Query-time Graph Traversal: Use the graph structure to find connected information

Microsoft's Numbers: Research shows GraphRAG has a 70-80% "win rate" vs traditional RAG for complex analytical questions. The trade-off? It's more expensive and complex to implement.

When to Use GraphRAG

Good Candidates:

  • Research and analysis use cases
  • Legal document review requiring pattern identification
  • Business intelligence queries across large document sets
  • Investigative journalism connecting disparate sources
  • Academic research synthesis and discovery

Skip GraphRAG For:

  • Simple Q&A use cases
  • Technical documentation lookup
  • Customer support with direct answers
  • Cost-sensitive implementations
  • Small document sets (<1000 documents)

Building Knowledge Connections

Implementation Strategy:

Phase 1: Entity and Relationship Extraction Use LLMs to identify entities and relationships in your documents:

  • People, organizations, products, concepts
  • "mentions", "relates to", "part of", "caused by" relationships
  • Store in a graph database (Neo4j, Amazon Neptune, or ArangoDB)

Phase 2: Community Detection Group related entities using algorithms like:

  • Leiden algorithm (Microsoft's choice) for high-quality communities
  • Louvain method for faster processing
  • Hierarchical clustering for nested community structures

Phase 3: Query Processing For analytical questions:

  1. Identify which communities contain relevant information
  2. Retrieve community summaries instead of individual chunks
  3. Use summaries to guide deeper retrieval if needed
  4. Generate answers that synthesize across communities

The Cost Reality: GraphRAG is 2-5x more expensive than traditional RAG due to:

  • Entity extraction requiring LLM calls for every document
  • Graph storage and maintenance overhead
  • More complex query processing
  • Community summarization costs

But for analytical use cases, the quality improvement often justifies the cost.

Hybrid Approach: Best of Both Worlds

Smart Implementation Strategy:

  1. Use traditional RAG for direct, factual questions
  2. Route analytical queries to GraphRAG automatically
  3. Implement query classification to choose the right approach
  4. Cache community summaries to reduce ongoing costs

Query Routing Logic:

def route_query(question):
    analytical_keywords = ['trends', 'patterns', 'themes', 'analysis', 'compare', 'relationship']
    
    if any(keyword in question.lower() for keyword in analytical_keywords):
        return "graphrag"
    else:
        return "traditional_rag"

Progressive Enhancement:

  • Start with traditional RAG for 80% of use cases
  • Add GraphRAG for specific analytical requirements
  • Monitor which approach works better for different query types
  • Gradually expand GraphRAG usage based on proven value

The key insight: GraphRAG isn't a replacement for traditional RAG—it's a powerful complement for analytical use cases that require understanding connections and patterns across large document collections.

Ready to implement these optimization techniques in your RAG system? The performance improvements and advanced patterns you've learned here will transform your RAG system from a basic prototype into a production-ready solution that scales with your needs while maintaining quality and user satisfaction.

More articles

RAG Implementation Roadmap: Avoiding Pitfalls and 90-Day Success Plan

Learn from expensive RAG implementation mistakes and follow our proven 90-day roadmap to take your system from prototype to production successfully.

Read more

RAG Architecture and Vector Database Selection: Complete Decision Framework

Master RAG architecture patterns that scale and choose the right vector database. Real benchmarks comparing Milvus, Weaviate, Pinecone, and Qdrant with cost analysis and performance data.

Read more

Let’s turn your vision into reality.