Back to Blog
November 28, 20247 min read

Optimizing RAG Pipelines for Healthcare: From 2s Latency to 200ms

RAGHealthcare AIVector DatabasesQdrantPerformanceLangChain

Optimizing RAG Pipelines for Healthcare: From 2s Latency to 200ms

When we launched Genio Pulse — our medical AI assistant serving 5,000+ doctors — the initial RAG pipeline took 2-3 seconds per query. For a tool that doctors use mid-consultation, that was unacceptable. Over six months of iterative optimization, we brought it down to under 200ms while improving retrieval relevance from 78% to 95%.

This post documents every optimization we made, in the order we implemented them. If you're building RAG for production, these lessons will save you months.

The Starting Point

Our initial architecture was textbook:

User Query → Embed → Vector Search (Qdrant) → Top-K → LLM → Response

Baseline metrics:

  • Embedding latency: ~100ms
  • Vector search: ~150ms
  • LLM generation: ~1500ms
  • Total: ~2000ms
  • Retrieval relevance (manual eval): 78%

The LLM generation time is mostly out of our control (model provider latency), so we focused on everything else — and discovered that better retrieval actually reduces generation time too, because the LLM gets more relevant context and generates faster, more focused answers.

Optimization 1: Chunking Strategy (78% → 84% relevance)

Our first mistake was using a naive text splitter with fixed chunk sizes. Medical literature has structure — headings, sections, bullet points, references — and splitting mid-paragraph destroys context.

Before:

# Naive: fixed 1000-char chunks with 200-char overlap
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)

After:

# Semantic chunking that respects document structure
class MedicalDocumentChunker:
    def chunk(self, document: str, metadata: dict) -> list[Document]:
        sections = self.split_by_headers(document)
        chunks = []
 
        for section in sections:
            if len(section.content) < 512:
                # Small sections stay intact
                chunks.append(Document(
                    content=section.content,
                    metadata={
                        **metadata,
                        "section": section.header,
                        "chunk_type": "full_section",
                    }
                ))
            else:
                # Large sections get sentence-aware splitting
                sub_chunks = self.sentence_split(
                    section.content,
                    target_size=512,
                    overlap_sentences=2,
                )
                for i, sub in enumerate(sub_chunks):
                    chunks.append(Document(
                        content=f"Section: {section.header}\n\n{sub}",
                        metadata={
                            **metadata,
                            "section": section.header,
                            "chunk_index": i,
                            "chunk_type": "sub_section",
                        }
                    ))
 
        return chunks

Key changes:

  • Chunk size: 1000 → 512 tokens (smaller chunks = more precise retrieval)
  • Overlap: character-based → sentence-based (2 sentences overlap)
  • Structure-aware: respect section boundaries
  • Prepend section header to each chunk (preserves context)

Impact: Relevance jumped from 78% to 84%. Latency unchanged.

Optimization 2: Hybrid Search (84% → 90% relevance)

Pure vector similarity misses exact medical terms. If a doctor searches for "metformin 500mg side effects," pure semantic search might return results about diabetes medications generally. We need exact keyword matching too.

from qdrant_client.models import SparseVector
 
class HybridSearcher:
    def search(self, query: str, limit: int = 10) -> list[SearchResult]:
        # Dense vector search (semantic understanding)
        dense_results = self.qdrant.search(
            collection_name="medical_docs",
            query_vector=("dense", self.embed(query)),
            limit=limit * 2,  # Fetch more for fusion
        )
 
        # Sparse vector search (keyword matching via BM25)
        sparse_vector = self.bm25_encoder.encode(query)
        sparse_results = self.qdrant.search(
            collection_name="medical_docs",
            query_vector=("sparse", SparseVector(
                indices=sparse_vector.indices,
                values=sparse_vector.values,
            )),
            limit=limit * 2,
        )
 
        # Reciprocal Rank Fusion
        return self.rrf_fusion(dense_results, sparse_results, limit=limit)
 
    def rrf_fusion(self, *result_lists, limit: int, k: int = 60):
        """Combine rankings using Reciprocal Rank Fusion."""
        scores = {}
        for results in result_lists:
            for rank, result in enumerate(results):
                doc_id = result.id
                scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank + 1)
 
        sorted_docs = sorted(scores.items(), key=lambda x: x[1], reverse=True)
        return sorted_docs[:limit]

Why RRF over weighted scoring: Reciprocal Rank Fusion doesn't require tuning weights between dense and sparse scores. It's robust and works well out of the box.

Impact: Relevance improved from 84% to 90%. Search latency increased slightly (~20ms) due to dual search, but the quality improvement was worth it.

Optimization 3: Metadata Filtering (latency: 150ms → 40ms)

Most queries have implicit constraints that we can filter on before vector search:

def build_filters(self, query: str, doctor_profile: DoctorProfile):
    filters = []
 
    # Filter by medical specialty
    if doctor_profile.specialty:
        filters.append(
            FieldCondition(
                key="specialty",
                match=MatchAny(any=self.related_specialties(
                    doctor_profile.specialty
                )),
            )
        )
 
    # Filter by recency (prefer recent medical literature)
    filters.append(
        FieldCondition(
            key="publication_year",
            range=Range(gte=2019),  # Last 5 years
        )
    )
 
    # Filter by evidence level (prefer high-quality sources)
    filters.append(
        FieldCondition(
            key="evidence_level",
            match=MatchAny(any=["systematic_review", "rct", "meta_analysis"]),
        )
    )
 
    return Filter(must=filters)

Impact: By filtering before vector search, Qdrant searches a much smaller subset. Latency dropped from 150ms to 40ms, and relevance improved to 92% (irrelevant specialties no longer pollute results).

Optimization 4: Reranking (90% → 95% relevance)

Vector similarity is a rough first pass. A cross-encoder reranker scores each query-document pair more accurately:

from sentence_transformers import CrossEncoder
 
class Reranker:
    def __init__(self):
        self.model = CrossEncoder(
            "cross-encoder/ms-marco-MiniLM-L-12-v2",
            max_length=512,
        )
 
    def rerank(self, query: str, documents: list, top_k: int = 5):
        pairs = [(query, doc.content) for doc in documents]
        scores = self.model.predict(pairs)
 
        scored_docs = list(zip(documents, scores))
        scored_docs.sort(key=lambda x: x[1], reverse=True)
 
        return [doc for doc, score in scored_docs[:top_k]]

Pipeline after reranking:

Query → Hybrid Search (top 20) → Rerank (top 5) → LLM

Impact: Relevance went from 92% to 95%. Reranking adds ~30ms but the LLM gets much better context, which actually reduced generation time by ~200ms (shorter, more focused responses).

Optimization 5: Embedding Cache (100ms → 5ms for repeated queries)

Doctors often ask similar questions. We cache embeddings in Redis:

class CachedEmbedder:
    def __init__(self, redis_client, embedder, ttl=3600):
        self.redis = redis_client
        self.embedder = embedder
        self.ttl = ttl
 
    def embed(self, text: str) -> list[float]:
        cache_key = f"emb:{hashlib.md5(text.encode()).hexdigest()}"
        cached = self.redis.get(cache_key)
 
        if cached:
            return json.loads(cached)
 
        embedding = self.embedder.embed_query(text)
        self.redis.setex(cache_key, self.ttl, json.dumps(embedding))
        return embedding

Impact: For cache hits (~40% of queries), embedding latency drops from 100ms to 5ms.

Optimization 6: Streaming + Parallel Processing

Instead of sequential execution, we parallelize independent steps:

import asyncio
 
async def rag_query(query: str, doctor: DoctorProfile):
    # Run embedding and filter building in parallel
    embedding_task = asyncio.create_task(embedder.aembed(query))
    filters = build_filters(query, doctor)
 
    embedding = await embedding_task
 
    # Hybrid search (dense + sparse in parallel internally)
    results = await hybrid_searcher.asearch(
        embedding=embedding,
        filters=filters,
        limit=20,
    )
 
    # Rerank
    top_docs = reranker.rerank(query, results, top_k=5)
 
    # Stream LLM response (don't wait for full generation)
    async for chunk in llm.astream(
        prompt=build_prompt(query, top_docs),
    ):
        yield chunk

Impact: Total perceived latency dropped to ~200ms for first token (streaming means the doctor sees text appearing immediately).

The Final Architecture

Query
  ├─ [Parallel] Cache Check → Embed (5-100ms)
  ├─ [Parallel] Build Filters (1ms)
  │
  ├─ Hybrid Search (dense + sparse) with Filters (40ms)
  ├─ Rerank Top 20 → Top 5 (30ms)
  │
  └─ Stream LLM Response (first token: ~150ms)

Total to first token: ~200ms
Total to complete response: ~800ms

Monitoring in Production

We track these metrics continuously:

| Metric | Target | Alert Threshold | |--------|--------|----------------| | P50 latency (to first token) | under 200ms | over 500ms | | P99 latency (to first token) | under 500ms | over 1500ms | | Retrieval relevance (weekly eval) | above 93% | below 88% | | Cache hit rate | above 35% | below 20% | | Embedding API errors | below 0.1% | above 1% | | Cost per query | under $0.03 | over $0.10 |

# We log every query for continuous evaluation
@dataclass
class QueryLog:
    query: str
    doctor_specialty: str
    retrieved_doc_ids: list[str]
    reranker_scores: list[float]
    embedding_latency_ms: float
    search_latency_ms: float
    rerank_latency_ms: float
    generation_latency_ms: float
    total_latency_ms: float
    cache_hit: bool
    token_usage: dict

Lessons Learned

  1. Chunking quality > embedding model quality. We got a bigger relevance boost from better chunking than from upgrading embedding models.

  2. Filter before you search. Metadata filtering is free performance. Use every signal you have (specialty, recency, document type) to narrow the search space.

  3. Reranking is almost always worth it. The 30ms cost pays for itself by giving the LLM better context (faster, more accurate responses).

  4. Cache everything you can. Embeddings, search results, even full responses for common queries. In healthcare, many questions are variations of the same few hundred queries.

  5. Measure retrieval quality separately from generation quality. A bad retrieval + good LLM still produces a bad answer. Track retrieval relevance independently.

  6. Stream, don't wait. Perceived latency matters more than total latency. First-token-time is the metric that determines user satisfaction.


These optimizations didn't happen overnight — they were spread across six months of iterative improvement, driven by user feedback and continuous monitoring. The key is having good observability so you know where to focus next.