February 20, 20256 min read

Multi-Agent Systems in Production: Patterns That Actually Work

Multi-Agent SystemsLangGraphAI ArchitectureProduction AISystem Design

Multi-Agent Systems in Production: Patterns That Actually Work

After building and deploying multi-agent systems that handle thousands of requests daily in healthcare and customer support, I've developed strong opinions about what works and what doesn't. This post covers the patterns that survived contact with production reality.

The Problem with Single-Agent Architectures

Most tutorials show you a single LLM agent with a bag of tools. This works great for demos. In production, it falls apart because:

Context window pollution: One agent handling everything means every tool description, every piece of context, competes for attention
No separation of concerns: A billing question and a technical support question require completely different context, tools, and guardrails
Debugging nightmares: When something goes wrong, you're searching through one massive chain of thought
Inconsistent quality: An agent that's great at classification might be terrible at generation, and vice versa

Pattern 1: The Router-Specialist Architecture

This is the pattern I use most. A lightweight router agent classifies the request and delegates to specialized agents.

from langgraph.graph import StateGraph, END
from typing import TypedDict, Literal
 
class OrchestratorState(TypedDict):
    messages: list
    classification: str
    specialist_response: str
    confidence: float
    iteration: int
 
def router(state: OrchestratorState) -> OrchestratorState:
    """Lightweight classifier — fast model, minimal prompt."""
    classification = classify_with_gpt4_mini(
        state["messages"][-1],
        categories=["medical_query", "billing", "scheduling", "general"]
    )
    return {
        **state,
        "classification": classification.category,
        "confidence": classification.confidence,
    }
 
def route_to_specialist(state: OrchestratorState) -> str:
    """Deterministic routing based on classification."""
    if state["confidence"] < 0.7:
        return "clarification_agent"
    return f"{state['classification']}_agent"
 
graph = StateGraph(OrchestratorState)
graph.add_node("router", router)
graph.add_node("medical_query_agent", medical_agent)
graph.add_node("billing_agent", billing_agent)
graph.add_node("scheduling_agent", scheduling_agent)
graph.add_node("general_agent", general_agent)
graph.add_node("clarification_agent", clarification_agent)
 
graph.set_entry_point("router")
graph.add_conditional_edges("router", route_to_specialist)

Why this works:

The router uses a fast, cheap model (GPT-4 mini or Claude Haiku)
Each specialist has its own tailored prompt, tools, and guardrails
Adding a new category means adding one specialist — no changes to existing agents
Classification confidence below threshold triggers clarification instead of guessing

Pattern 2: The Pipeline Architecture

For tasks with sequential stages, I use a pipeline where each agent's output feeds into the next.

[Document Ingestion] → [Entity Extraction] → [Analysis] → [Report Generation]

Real-world example: Medical Report Analysis

In our healthcare platform, when a doctor uploads a patient report:

Extraction Agent: Pulls structured data (lab values, medications, diagnoses) from unstructured text
Validation Agent: Cross-references extracted data against medical ontologies (ICD-10, SNOMED)
Analysis Agent: Identifies patterns, flags abnormalities, suggests differential diagnoses
Report Agent: Generates a structured clinical summary for the doctor

def extraction_node(state):
    """Extract structured medical entities from raw text."""
    extracted = medical_ner_chain.invoke({
        "report": state["raw_report"],
        "extraction_schema": MEDICAL_ENTITY_SCHEMA,
    })
    return {**state, "entities": extracted, "stage": "extraction_complete"}
 
def validation_node(state):
    """Validate extracted entities against medical ontologies."""
    validated = []
    for entity in state["entities"]:
        is_valid = validate_against_ontology(entity)
        validated.append({**entity, "validated": is_valid})
    return {**state, "entities": validated, "stage": "validation_complete"}

Key insight: Each agent in the pipeline can use a different model. The extraction agent might use a fine-tuned model, while the analysis agent uses GPT-4 with a carefully crafted prompt.

Pattern 3: The Debate Architecture

For high-stakes decisions, I use multiple agents that independently analyze the same input and then a synthesis agent resolves disagreements.

This pattern is critical in our medical AI system where diagnostic accuracy matters:

def parallel_analysis(state):
    """Run multiple diagnostic agents in parallel."""
    analyses = []
    for agent in [conservative_agent, aggressive_agent, specialist_agent]:
        result = agent.invoke(state["patient_data"])
        analyses.append(result)
    return {**state, "analyses": analyses}
 
def synthesis(state):
    """Resolve disagreements between agents."""
    analyses = state["analyses"]
 
    # If all agents agree, high confidence
    if all_agree(analyses):
        return {**state, "final_diagnosis": analyses[0], "confidence": 0.95}
 
    # If disagreement, present options with reasoning
    return {
        **state,
        "final_diagnosis": synthesize_with_reasoning(analyses),
        "confidence": calculate_agreement_score(analyses),
        "requires_human_review": True,
    }

When to use this: Only for decisions where being wrong has serious consequences. It's expensive (3x the LLM calls) but catches errors that single-agent systems miss.

Pattern 4: Human-in-the-Loop Escalation

Every production multi-agent system needs a graceful path to human intervention. Here's how I implement it:

ESCALATION_TRIGGERS = {
    "low_confidence": lambda state: state["confidence"] < 0.6,
    "sensitive_topic": lambda state: any(
        t in state["classification"] for t in ["legal", "emergency", "complaint"]
    ),
    "repeated_clarification": lambda state: state["clarification_count"] > 2,
    "user_request": lambda state: "speak to human" in state["messages"][-1].lower(),
}
 
def should_escalate(state) -> bool:
    return any(check(state) for check in ESCALATION_TRIGGERS.values())

The escalation should include:

Full conversation history
Agent classification and confidence scores
Attempted solutions and why they failed
Suggested actions for the human operator

Failure Modes I've Encountered

1. The Infinite Loop

An agent that keeps calling tools without making progress. Solution: Hard iteration limits with graceful degradation.

2. The Confident Wrong Answer

An agent that produces a plausible-sounding but incorrect response with high confidence. Solution: The debate pattern, plus always showing confidence scores to end users.

3. The Context Leak

Information from one conversation leaking into another due to shared state. Solution: Strict state isolation per conversation, no global mutable state.

4. The Cost Spiral

A complex query that triggers cascading agent calls, burning through tokens. Solution: Per-request cost budgets with automatic circuit breaking.

class CostTracker:
    def __init__(self, max_budget_cents: int = 50):
        self.max_budget = max_budget_cents
        self.current_spend = 0
 
    def track(self, model: str, tokens: int):
        cost = calculate_cost(model, tokens)
        self.current_spend += cost
        if self.current_spend > self.max_budget:
            raise BudgetExceededError(
                f"Request exceeded budget: ${self.current_spend/100:.2f}"
            )

Observability Is Non-Negotiable

You cannot debug multi-agent systems without proper observability. At minimum, you need:

Trace IDs that follow a request through all agents
Token usage per agent per request
Latency breakdowns (which agent is the bottleneck?)
Decision logs (why did the router choose this specialist?)
Error rates per agent (is one specialist failing more than others?)

We use LangSmith for tracing and custom Prometheus metrics for monitoring. The dashboard shows real-time agent performance and automatically alerts when error rates exceed thresholds.

Key Takeaways

Start with two agents (router + one specialist), not ten. Add complexity only when data proves you need it.
Each agent should have a single, clear responsibility. If you can't describe what an agent does in one sentence, it's doing too much.
Invest heavily in the router. A good router makes average specialists work well. A bad router makes excellent specialists useless.
Budget for observability from day one. You'll spend more time debugging than building.
Always have a human escalation path. The goal is augmenting humans, not replacing them.

In my next post, I'll dive deeper into RAG pipeline optimization — specifically, how we achieved sub-200ms retrieval latency while maintaining 95%+ relevance for medical queries.