Optimizing RAG Pipelines for Healthcare: From 2s Latency to 200ms
Optimizing RAG Pipelines for Healthcare: From 2s Latency to 200ms
When we launched Genio Pulse — our medical AI assistant serving 5,000+ doctors — the initial RAG pipeline took 2-3 seconds per query. For a tool that doctors use mid-consultation, that was unacceptable. Over six months of iterative optimization, we brought it down to under 200ms while improving retrieval relevance from 78% to 95%.
This post documents every optimization we made, in the order we implemented them. If you're building RAG for production, these lessons will save you months.
The Starting Point
Our initial architecture was textbook:
User Query → Embed → Vector Search (Qdrant) → Top-K → LLM → Response
Baseline metrics:
- Embedding latency: ~100ms
- Vector search: ~150ms
- LLM generation: ~1500ms
- Total: ~2000ms
- Retrieval relevance (manual eval): 78%
The LLM generation time is mostly out of our control (model provider latency), so we focused on everything else — and discovered that better retrieval actually reduces generation time too, because the LLM gets more relevant context and generates faster, more focused answers.
Optimization 1: Chunking Strategy (78% → 84% relevance)
Our first mistake was using a naive text splitter with fixed chunk sizes. Medical literature has structure — headings, sections, bullet points, references — and splitting mid-paragraph destroys context.
Before:
# Naive: fixed 1000-char chunks with 200-char overlap
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)After:
# Semantic chunking that respects document structure
class MedicalDocumentChunker:
def chunk(self, document: str, metadata: dict) -> list[Document]:
sections = self.split_by_headers(document)
chunks = []
for section in sections:
if len(section.content) < 512:
# Small sections stay intact
chunks.append(Document(
content=section.content,
metadata={
**metadata,
"section": section.header,
"chunk_type": "full_section",
}
))
else:
# Large sections get sentence-aware splitting
sub_chunks = self.sentence_split(
section.content,
target_size=512,
overlap_sentences=2,
)
for i, sub in enumerate(sub_chunks):
chunks.append(Document(
content=f"Section: {section.header}\n\n{sub}",
metadata={
**metadata,
"section": section.header,
"chunk_index": i,
"chunk_type": "sub_section",
}
))
return chunksKey changes:
- Chunk size: 1000 → 512 tokens (smaller chunks = more precise retrieval)
- Overlap: character-based → sentence-based (2 sentences overlap)
- Structure-aware: respect section boundaries
- Prepend section header to each chunk (preserves context)
Impact: Relevance jumped from 78% to 84%. Latency unchanged.
Optimization 2: Hybrid Search (84% → 90% relevance)
Pure vector similarity misses exact medical terms. If a doctor searches for "metformin 500mg side effects," pure semantic search might return results about diabetes medications generally. We need exact keyword matching too.
from qdrant_client.models import SparseVector
class HybridSearcher:
def search(self, query: str, limit: int = 10) -> list[SearchResult]:
# Dense vector search (semantic understanding)
dense_results = self.qdrant.search(
collection_name="medical_docs",
query_vector=("dense", self.embed(query)),
limit=limit * 2, # Fetch more for fusion
)
# Sparse vector search (keyword matching via BM25)
sparse_vector = self.bm25_encoder.encode(query)
sparse_results = self.qdrant.search(
collection_name="medical_docs",
query_vector=("sparse", SparseVector(
indices=sparse_vector.indices,
values=sparse_vector.values,
)),
limit=limit * 2,
)
# Reciprocal Rank Fusion
return self.rrf_fusion(dense_results, sparse_results, limit=limit)
def rrf_fusion(self, *result_lists, limit: int, k: int = 60):
"""Combine rankings using Reciprocal Rank Fusion."""
scores = {}
for results in result_lists:
for rank, result in enumerate(results):
doc_id = result.id
scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank + 1)
sorted_docs = sorted(scores.items(), key=lambda x: x[1], reverse=True)
return sorted_docs[:limit]Why RRF over weighted scoring: Reciprocal Rank Fusion doesn't require tuning weights between dense and sparse scores. It's robust and works well out of the box.
Impact: Relevance improved from 84% to 90%. Search latency increased slightly (~20ms) due to dual search, but the quality improvement was worth it.
Optimization 3: Metadata Filtering (latency: 150ms → 40ms)
Most queries have implicit constraints that we can filter on before vector search:
def build_filters(self, query: str, doctor_profile: DoctorProfile):
filters = []
# Filter by medical specialty
if doctor_profile.specialty:
filters.append(
FieldCondition(
key="specialty",
match=MatchAny(any=self.related_specialties(
doctor_profile.specialty
)),
)
)
# Filter by recency (prefer recent medical literature)
filters.append(
FieldCondition(
key="publication_year",
range=Range(gte=2019), # Last 5 years
)
)
# Filter by evidence level (prefer high-quality sources)
filters.append(
FieldCondition(
key="evidence_level",
match=MatchAny(any=["systematic_review", "rct", "meta_analysis"]),
)
)
return Filter(must=filters)Impact: By filtering before vector search, Qdrant searches a much smaller subset. Latency dropped from 150ms to 40ms, and relevance improved to 92% (irrelevant specialties no longer pollute results).
Optimization 4: Reranking (90% → 95% relevance)
Vector similarity is a rough first pass. A cross-encoder reranker scores each query-document pair more accurately:
from sentence_transformers import CrossEncoder
class Reranker:
def __init__(self):
self.model = CrossEncoder(
"cross-encoder/ms-marco-MiniLM-L-12-v2",
max_length=512,
)
def rerank(self, query: str, documents: list, top_k: int = 5):
pairs = [(query, doc.content) for doc in documents]
scores = self.model.predict(pairs)
scored_docs = list(zip(documents, scores))
scored_docs.sort(key=lambda x: x[1], reverse=True)
return [doc for doc, score in scored_docs[:top_k]]Pipeline after reranking:
Query → Hybrid Search (top 20) → Rerank (top 5) → LLM
Impact: Relevance went from 92% to 95%. Reranking adds ~30ms but the LLM gets much better context, which actually reduced generation time by ~200ms (shorter, more focused responses).
Optimization 5: Embedding Cache (100ms → 5ms for repeated queries)
Doctors often ask similar questions. We cache embeddings in Redis:
class CachedEmbedder:
def __init__(self, redis_client, embedder, ttl=3600):
self.redis = redis_client
self.embedder = embedder
self.ttl = ttl
def embed(self, text: str) -> list[float]:
cache_key = f"emb:{hashlib.md5(text.encode()).hexdigest()}"
cached = self.redis.get(cache_key)
if cached:
return json.loads(cached)
embedding = self.embedder.embed_query(text)
self.redis.setex(cache_key, self.ttl, json.dumps(embedding))
return embeddingImpact: For cache hits (~40% of queries), embedding latency drops from 100ms to 5ms.
Optimization 6: Streaming + Parallel Processing
Instead of sequential execution, we parallelize independent steps:
import asyncio
async def rag_query(query: str, doctor: DoctorProfile):
# Run embedding and filter building in parallel
embedding_task = asyncio.create_task(embedder.aembed(query))
filters = build_filters(query, doctor)
embedding = await embedding_task
# Hybrid search (dense + sparse in parallel internally)
results = await hybrid_searcher.asearch(
embedding=embedding,
filters=filters,
limit=20,
)
# Rerank
top_docs = reranker.rerank(query, results, top_k=5)
# Stream LLM response (don't wait for full generation)
async for chunk in llm.astream(
prompt=build_prompt(query, top_docs),
):
yield chunkImpact: Total perceived latency dropped to ~200ms for first token (streaming means the doctor sees text appearing immediately).
The Final Architecture
Query
├─ [Parallel] Cache Check → Embed (5-100ms)
├─ [Parallel] Build Filters (1ms)
│
├─ Hybrid Search (dense + sparse) with Filters (40ms)
├─ Rerank Top 20 → Top 5 (30ms)
│
└─ Stream LLM Response (first token: ~150ms)
Total to first token: ~200ms
Total to complete response: ~800ms
Monitoring in Production
We track these metrics continuously:
| Metric | Target | Alert Threshold | |--------|--------|----------------| | P50 latency (to first token) | under 200ms | over 500ms | | P99 latency (to first token) | under 500ms | over 1500ms | | Retrieval relevance (weekly eval) | above 93% | below 88% | | Cache hit rate | above 35% | below 20% | | Embedding API errors | below 0.1% | above 1% | | Cost per query | under $0.03 | over $0.10 |
# We log every query for continuous evaluation
@dataclass
class QueryLog:
query: str
doctor_specialty: str
retrieved_doc_ids: list[str]
reranker_scores: list[float]
embedding_latency_ms: float
search_latency_ms: float
rerank_latency_ms: float
generation_latency_ms: float
total_latency_ms: float
cache_hit: bool
token_usage: dictLessons Learned
-
Chunking quality > embedding model quality. We got a bigger relevance boost from better chunking than from upgrading embedding models.
-
Filter before you search. Metadata filtering is free performance. Use every signal you have (specialty, recency, document type) to narrow the search space.
-
Reranking is almost always worth it. The 30ms cost pays for itself by giving the LLM better context (faster, more accurate responses).
-
Cache everything you can. Embeddings, search results, even full responses for common queries. In healthcare, many questions are variations of the same few hundred queries.
-
Measure retrieval quality separately from generation quality. A bad retrieval + good LLM still produces a bad answer. Track retrieval relevance independently.
-
Stream, don't wait. Perceived latency matters more than total latency. First-token-time is the metric that determines user satisfaction.
These optimizations didn't happen overnight — they were spread across six months of iterative improvement, driven by user feedback and continuous monitoring. The key is having good observability so you know where to focus next.