Multi-Agent Systems in Production: Patterns That Actually Work
Multi-Agent Systems in Production: Patterns That Actually Work
After building and deploying multi-agent systems that handle thousands of requests daily in healthcare and customer support, I've developed strong opinions about what works and what doesn't. This post covers the patterns that survived contact with production reality.
The Problem with Single-Agent Architectures
Most tutorials show you a single LLM agent with a bag of tools. This works great for demos. In production, it falls apart because:
- Context window pollution: One agent handling everything means every tool description, every piece of context, competes for attention
- No separation of concerns: A billing question and a technical support question require completely different context, tools, and guardrails
- Debugging nightmares: When something goes wrong, you're searching through one massive chain of thought
- Inconsistent quality: An agent that's great at classification might be terrible at generation, and vice versa
Pattern 1: The Router-Specialist Architecture
This is the pattern I use most. A lightweight router agent classifies the request and delegates to specialized agents.
from langgraph.graph import StateGraph, END
from typing import TypedDict, Literal
class OrchestratorState(TypedDict):
messages: list
classification: str
specialist_response: str
confidence: float
iteration: int
def router(state: OrchestratorState) -> OrchestratorState:
"""Lightweight classifier — fast model, minimal prompt."""
classification = classify_with_gpt4_mini(
state["messages"][-1],
categories=["medical_query", "billing", "scheduling", "general"]
)
return {
**state,
"classification": classification.category,
"confidence": classification.confidence,
}
def route_to_specialist(state: OrchestratorState) -> str:
"""Deterministic routing based on classification."""
if state["confidence"] < 0.7:
return "clarification_agent"
return f"{state['classification']}_agent"
graph = StateGraph(OrchestratorState)
graph.add_node("router", router)
graph.add_node("medical_query_agent", medical_agent)
graph.add_node("billing_agent", billing_agent)
graph.add_node("scheduling_agent", scheduling_agent)
graph.add_node("general_agent", general_agent)
graph.add_node("clarification_agent", clarification_agent)
graph.set_entry_point("router")
graph.add_conditional_edges("router", route_to_specialist)Why this works:
- The router uses a fast, cheap model (GPT-4 mini or Claude Haiku)
- Each specialist has its own tailored prompt, tools, and guardrails
- Adding a new category means adding one specialist — no changes to existing agents
- Classification confidence below threshold triggers clarification instead of guessing
Pattern 2: The Pipeline Architecture
For tasks with sequential stages, I use a pipeline where each agent's output feeds into the next.
[Document Ingestion] → [Entity Extraction] → [Analysis] → [Report Generation]
Real-world example: Medical Report Analysis
In our healthcare platform, when a doctor uploads a patient report:
- Extraction Agent: Pulls structured data (lab values, medications, diagnoses) from unstructured text
- Validation Agent: Cross-references extracted data against medical ontologies (ICD-10, SNOMED)
- Analysis Agent: Identifies patterns, flags abnormalities, suggests differential diagnoses
- Report Agent: Generates a structured clinical summary for the doctor
def extraction_node(state):
"""Extract structured medical entities from raw text."""
extracted = medical_ner_chain.invoke({
"report": state["raw_report"],
"extraction_schema": MEDICAL_ENTITY_SCHEMA,
})
return {**state, "entities": extracted, "stage": "extraction_complete"}
def validation_node(state):
"""Validate extracted entities against medical ontologies."""
validated = []
for entity in state["entities"]:
is_valid = validate_against_ontology(entity)
validated.append({**entity, "validated": is_valid})
return {**state, "entities": validated, "stage": "validation_complete"}Key insight: Each agent in the pipeline can use a different model. The extraction agent might use a fine-tuned model, while the analysis agent uses GPT-4 with a carefully crafted prompt.
Pattern 3: The Debate Architecture
For high-stakes decisions, I use multiple agents that independently analyze the same input and then a synthesis agent resolves disagreements.
This pattern is critical in our medical AI system where diagnostic accuracy matters:
def parallel_analysis(state):
"""Run multiple diagnostic agents in parallel."""
analyses = []
for agent in [conservative_agent, aggressive_agent, specialist_agent]:
result = agent.invoke(state["patient_data"])
analyses.append(result)
return {**state, "analyses": analyses}
def synthesis(state):
"""Resolve disagreements between agents."""
analyses = state["analyses"]
# If all agents agree, high confidence
if all_agree(analyses):
return {**state, "final_diagnosis": analyses[0], "confidence": 0.95}
# If disagreement, present options with reasoning
return {
**state,
"final_diagnosis": synthesize_with_reasoning(analyses),
"confidence": calculate_agreement_score(analyses),
"requires_human_review": True,
}When to use this: Only for decisions where being wrong has serious consequences. It's expensive (3x the LLM calls) but catches errors that single-agent systems miss.
Pattern 4: Human-in-the-Loop Escalation
Every production multi-agent system needs a graceful path to human intervention. Here's how I implement it:
ESCALATION_TRIGGERS = {
"low_confidence": lambda state: state["confidence"] < 0.6,
"sensitive_topic": lambda state: any(
t in state["classification"] for t in ["legal", "emergency", "complaint"]
),
"repeated_clarification": lambda state: state["clarification_count"] > 2,
"user_request": lambda state: "speak to human" in state["messages"][-1].lower(),
}
def should_escalate(state) -> bool:
return any(check(state) for check in ESCALATION_TRIGGERS.values())The escalation should include:
- Full conversation history
- Agent classification and confidence scores
- Attempted solutions and why they failed
- Suggested actions for the human operator
Failure Modes I've Encountered
1. The Infinite Loop
An agent that keeps calling tools without making progress. Solution: Hard iteration limits with graceful degradation.
2. The Confident Wrong Answer
An agent that produces a plausible-sounding but incorrect response with high confidence. Solution: The debate pattern, plus always showing confidence scores to end users.
3. The Context Leak
Information from one conversation leaking into another due to shared state. Solution: Strict state isolation per conversation, no global mutable state.
4. The Cost Spiral
A complex query that triggers cascading agent calls, burning through tokens. Solution: Per-request cost budgets with automatic circuit breaking.
class CostTracker:
def __init__(self, max_budget_cents: int = 50):
self.max_budget = max_budget_cents
self.current_spend = 0
def track(self, model: str, tokens: int):
cost = calculate_cost(model, tokens)
self.current_spend += cost
if self.current_spend > self.max_budget:
raise BudgetExceededError(
f"Request exceeded budget: ${self.current_spend/100:.2f}"
)Observability Is Non-Negotiable
You cannot debug multi-agent systems without proper observability. At minimum, you need:
- Trace IDs that follow a request through all agents
- Token usage per agent per request
- Latency breakdowns (which agent is the bottleneck?)
- Decision logs (why did the router choose this specialist?)
- Error rates per agent (is one specialist failing more than others?)
We use LangSmith for tracing and custom Prometheus metrics for monitoring. The dashboard shows real-time agent performance and automatically alerts when error rates exceed thresholds.
Key Takeaways
- Start with two agents (router + one specialist), not ten. Add complexity only when data proves you need it.
- Each agent should have a single, clear responsibility. If you can't describe what an agent does in one sentence, it's doing too much.
- Invest heavily in the router. A good router makes average specialists work well. A bad router makes excellent specialists useless.
- Budget for observability from day one. You'll spend more time debugging than building.
- Always have a human escalation path. The goal is augmenting humans, not replacing them.
In my next post, I'll dive deeper into RAG pipeline optimization — specifically, how we achieved sub-200ms retrieval latency while maintaining 95%+ relevance for medical queries.